2017 Edition
Risk analysis is part of every decision we make when faced with uncertainty, ambiguity, and variability. Indeed, even though we have unprecedented access to information, we can't accurately predict the future. In finance, there is a fair amount of uncertainty and risk involved with estimating the future value of financial products, due to the wide variety of potential outcomes. Monte Carlo simulation (also known as the Monte Carlo Method) allows inspecting many possible outcomes of the decision making process, and can be used to assess the impact of risk: this, in turns, allows for better decision-making under uncertainty.
The main objectives we set for this Notebook are as follows:
This Notebook is inspired by Chapter 9 of the book Advanced Analytics with Spark by Josh Wills, Sandy Ryza, Sean Owen, and Uri Laserson. It is strongly suggested to read this Chapter to get a general idea of the topic of this Notebook.
Monte Carlo simulation is a computerized mathematical technique that can be applied such that it is possible to account for risk in quantitative analysis and decision making. This technique is used in many different fields, such as R&D, risk management, portfolio management, pricing derivatives, strategic planning, project planning, cost modeling and many more.
In general, MCS is a technique that "converts" uncertainty on input variables of a model into probability distributions. By combining the distributions and randomly selecting values from them, it recalculates the simulated model many times, to determine the probability of the output.
Historically, this technique was first used by scientists working on the atomic bomb: it was named after Monte Carlo, the Monaco resort town renowned for its casinos. Since its introduction in World War II, Monte Carlo simulation has been used to model a variety of physical and conceptual systems.
Monte Carlo simulation performs risk analysis by building models of possible results by substituting a range of possible input values, that constitute uncertainty, into a statistical distribution. It then computes possible outcomes repeatedly, each time using a different set of random values from the probability functions that "model" the input. Depending upon the number of random input variables and their distribution, a Monte Carlo simulation could involve thousands or tens of thousands of "rounds" before it is complete. When complete, Monte Carlo simulation produces distributions of possible outcome values.
By using probability distributions instead of actual input samples, it is possible to model more accurately uncertainty: different choices of distributions will yield different outputs.
Imagine you are the marketing manager for a firm that is planning to introduce a new product. You need to estimate the first-year net profit from this product, which might depend on:
Net profit will be calculated as $Net Profit = Sales Volume* (Selling Price - Unit cost) - Fixed costs$. Fixed costs (accounting for various overheads, advertising budget, etc.) are known to be \$ 120,000, which we assume to be deterministic. All other factors, instead, involve some uncertainty: sales volume (in units) can cover quite a large range, the selling price per unit will depend on competitor actions, which are hard to predict, and unit costs will also vary depending on vendor prices and production experience, for example.
Now, to build a risk analysis model, we must first identify the uncertain variables -- which are essentially random variables. While there's some uncertainty in almost all variables in a business model, we want to focus on variables where the range of values is significant.
Based on a hypothetical market research you have done, you have beliefs that there are equal chances for the market to be slow, normal, or hot:

average_unit = (50000+75000+100000)/3
average_price = (11.00+10.00+8.00)/3
print("average unit:", average_unit)
print("average_price:", average_price)
Another uncertain variable is Unit Cost. In our illustrative example, we assume that your firm's production manager advises you that unit costs may be anywhere from \$5.50 to \$7.50, with a most likely expected cost of \$6.50. In this case, the most likely cost can be considered as the average cost.
Our next step is to identify uncertain functions -- also called functions of a random variable. Recall that Net Profit is calculated as $Net Profit = Sales Volume * (Selling Price - Unit cost) - Fixed costs$. However, Sales Volume, Selling Price and Unit Cost are all uncertain variables, so Net Profit is an uncertain function.
The simplest model to predict the Net Profit is using average of sales volume, average of selling price and average of unit cost for calculating. So, if only consider averages, we can say that the $Net Profit = 75,000*(9.66666666 - 6.5) - 120,000 \sim 117,500$.
However, as Dr. Sam Savage warns, "Plans based on average assumptions will be wrong on average." The calculated result is far from the actual value: indeed, the true average Net Profit is roughly \$93,000, as we will see later in the example.

def calNetProfit(average_unit, average_price, average_unitcost, fixed_cost):
return average_unit*(average_price-average_unitcost)-fixed_cost
average_unitcost = 6.5
fixed_cost = 120000
NetProfit = calNetProfit(average_unit,average_price,average_unitcost,fixed_cost)
print("Net profit:", NetProfit)
average_price_v2 = (11.00*50000+10.00*75000+8.00*100000)/(3*average_unit)
print("average unit:", average_unit)
print("average_price:", average_price)
NetProfit = calNetProfit(average_unit,average_price_v2,average_unitcost,fixed_cost)
print("Net profit:", NetProfit)
trueNetProfit = 93000
error = (NetProfit - trueNetProfit) / (trueNetProfit)
print("Error in percentage:{0:.3f}%".format(error *100))
As discussed before, the selling price and selling volume both depend on the state of the market scenario (slow/normal/hot). So, the net profit is the result of two random variables: market scenario (which in turn determines sales volumes and selling price) and unit cost.
Now, let's assume (this is an a-priori assumption we make) that market scenario follows a discrete, uniform distribution and that unit cost also follows a uniform distribution. Then, we can compute directly the values for selling price and selling volumes based on the outcome of the random variable market scenario, as shown in Section 2.1.
From these a-priori distributions, in each run (or trial) of our Monte Carlo simulation, we can generate the sample value for each random variable and use it to calculate the Net Profit. The more simulation runs, the more accurate our results will be. For example, if we run the simulation 100,000 times, the average net profit will amount to roughly \$92,600. Every time we run the simulation, a different prediction will be output: the average of such predictions will consistently be less than \$117,500, which we predicted using averages only.
Note also that in this simple example, we generate values for the market scenario and unit cost independently: we consider them to be independent random variables. This means that the eventual (and realistic!) correlation between the market scenario and unit cost variables is ignored. Later, we will learn how to be more precise and account for dependency between random variables.

# Get sales volume and price based on market scenario
# the function returns a tuple of (sales_volume, price)
def get_sales_volume_price(scenario):
# Slow market
if scenario == 0:
return (50000,11)
# Normal market
if scenario == 1:
return (75000,10)
# Hot market
else:
return (100000,8)
Function uniform(a,b) in module random generates a number $a<=c<=b$, which is drawn from a uniform distribution.
Function randint(a,b) helps you generating an integer number $a<=c<=b$
import random
samples = list()
total = 0.0
num_simulation = 100000
for i in range(0,num_simulation):
unit_cost = random.uniform(5.5,7.5)
market_scenario = random.randint(0,2)
#fixed_cost = random.uniform(120000,117500)
fixed_cost = 120000
sales_volume, price = get_sales_volume_price(market_scenario)
netProfit = calNetProfit(sales_volume,price,unit_cost,fixed_cost)
samples.append(netProfit)
total = total + netProfit
print("average net profit:", total/num_simulation)
plt.title("Distribution of Average net profit")
plt.hist(samples,bins=20,normed=1);
num_simulation = [10,100,1000,10000,100000,1000000,1500000,2000000,2500000,3000000]
for k in num_simulation:
total = 0.0
for i in range(0,k):
unit_cost = random.uniform(5.5,7.5)
market_scenario = random.randint(0,2)
#fixed_cost = random.uniform(120000,117500)
fixed_cost = 120000
sales_volume, price = get_sales_volume_price(market_scenario)
netProfit = calNetProfit(sales_volume,price,unit_cost,fixed_cost)
total = total + netProfit
print("average net profit with %d trials:"%(k), total/k)
In what follows, we summarize the most common probability distributions that are used as a-priori distributions for input random variables:
Normal/Gaussian Distribution: this is a continuous distribution applied in situations where the mean and the standard deviation of a given input variable are given, and the mean represents the most probable value of the variable. In other words, values "near" the mean are most likely to occur. This is symmetric distribution, and it is not bounded in its co-domain. It is very often used to describe natural phenomena, such as people’s heights, inflation rates, energy prices, and so on and so forth. An illustration of a normal distribution is given below:
![]()
Lognormal Distribution: this is a distribution which is appropriate for variables taking values in the range $[0, \infty]$. Values are positively skewed, not symmetric like a normal distribution. Examples of variables described by some lognormal distributions include, for example, real estate property values, stock prices, and oil reserves. An illustration of a lognormal distribution is given below:
![]()
Triangular Distribution: this is a continuous distribution with fixed minimum and maximum values. It is bounded by the minimum and maximum values and can be either symmetrical (the most probable value = mean = median) or asymmetrical. Values around the most likely value (e.g. the mean) are more likely to occur. Variables that could be described by a triangular distribution include, for example, past sales history per unit of time and inventory levels. An illustration of a triangular distribution is given below:
![]()
Uniform Distribution: this is a continuous distribution bounded by known minimum and maximum values. In contrast to the triangular distribution, the likelihood of occurrence of the values between the minimum and maximum is the same. In other words, all values have an equal chance of occurring, and the distribution is simply characterized by the minimum and maximum values. Examples of variables that can be described by a uniform distribution include manufacturing costs or future sales revenues for a new product. An illustration of the uniform distribution is given below:
![]()
Exponential Distribution: this is a continuous distribution used to model the time that pass between independent occurrences, provided that the rate of occurrences is known. An example of the exponential distribution is given below:
![]()
Discrete Distribution : for this kind of distribution, the "user" defines specific values that may occur and the likelihood of each of them. An example might be the results of a lawsuit: 20% chance of positive verdict, 30% change of negative verdict, 40% chance of settlement, and 10% chance of mistrial.
We hope that by now you have a good understanding about Monte Carlo simulation. Next, we apply this method to a real use case: financial risk estimation.
Imagine that you are an investor on the stock market. You plan to buy some stocks and you want to estimate the maximum loss you could incur after two weeks of investing. This is the quantity that the financial statistic "Value at Risk" (VaR) seeks to measure. VaR is defined as a measure of investment risk that can be used as a reasonable estimate of the maximum probable loss for a value of an investment portfolio, over a particular time period. A VaR statistic depends on three parameters: a portfolio, a time period, and a confidence level. A VaR of 1 million dollars with a 95% confidence level over two weeks, indicates the belief that the portfolio stands only a 5% chance of losing more than 1 million dollars over two weeks. VaR has seen widespread use across financial services organizations. This statistic plays a vital role in determining how much cash investors must hold to meet the credit ratings that they seek. In addition, it is also used to understand the risk characteristics of large portfolios: it is a good idea to compute the VaR before executing trades, such that it can help take informed decisions about investments.
Our goal is calculating VaR of two weeks interval with 95% confidence level and the associated VaR confidence interval.
In this use case, we will use some terms that might require a proper definition, given the domain. This is what we call the Domain Knowledge.
We have a list of instruments that we plan to invest in. The historical data of each instrument has been collected for you. For simplicity, assume that the returns of instruments at a given time, depend on 4 market factors only:
Our goal is building a model to predict the loss after two weeks' time interval with confidence level set to 95%.
As a side note, it is important to realize that the approach presented in this Notebook is a simplified version of what would happen in a real Financial firm. For example, the returns of instruments at a given time often depend on more than 4 market factors only! Moreover, the choice of what constitute an appropriate market factor is an art!
The stock data can be downloaded (or scraped) from Yahoo! by making a series of REST calls. The data includes multiple files. Each file contains the historical information of each instrument that we want to invest in. The data is in the following format (with some samples):
Date, Open, High, Low, Close, Volume, Adj Close
2016-01-22,66.239998,68.07,65.449997,67.860001,137400,67.860001
2016-01-21,65.410004,66.18,64.459999,65.050003,148000,65.050003
2016-01-20,64.279999,66.32,62.77,65.389999,141300,65.389999
2016-01-19,67.720001,67.989998,64.720001,65.379997,178400,65.379997
The data of GSPC and IXIC values (our two first market factors) are also available on Yahoo! and use the very same format.
The crude oil and treasure bonds data is collected from investing.com, and has a different format, as shown below (with some samples):
Date Price Open High Low Vol. Change %
Jan 25, 2016 32.17 32.36 32.44 32.10 - -0.59%
Jan 24, 2016 32.37 32.10 32.62 31.99 - 0.54%
Jan 22, 2016 32.19 29.84 32.35 29.53 - 9.01%
Jan 21, 2016 29.53 28.35 30.25 27.87 694.04K 11.22%
Jan 20, 2016 26.55 28.33 28.58 26.19 32.11K -6.71%
Jan 19, 2016 28.46 29.20 30.21 28.21 188.03K -5.21%
In our use case, the factors' data will be used jointly to build a statistical model: as a consequence, we first need to preprocess the data to proceed.
In this Notebook, all data files have been downloaded for you, such that you can focus on pre-processing. Next, we will:
We need two functions to read and parse data from Yahoo! and Investing.com respectively. We are interested only in information about the time and the corresponding returns of a factor or an instrument: as a consequence, we will project away many columns of our RAW data, and keep only the information we are interested in.
The 3000-instrument and the 4-factor history are small enough to be read and processed locally: we do not need to use the power of parallel computing to proceed. Note that this is true also for larger cases with hundreds of thousands of instruments and thousands of factors. The need for a distributed system like Spark comes in when actually running the Monte Carlo simulations, which can require massive amounts of computation on each instrument.

datetime object by using the function strptime(<string>, <dtime_format>). In this case, the datetime format is "%b %d, %Y". For more information, please follow this link.
In the next cell, we simply copy data from our HDFS cluster (that contains everything we need for this Notebook) to the instance (a Docker container) running your Notebook. This means that you will have "local" data that you can process without using Spark. Note the folder location: find and verify that you have correctly downloaded the files!
! [ -d monte-carlo-risk ] || (echo "Downloading prepared data from HDFS. Please wait..." ; hdfs dfs -copyToLocal /datasets/monte-carlo-risk . ; echo "Done!";)
from datetime import datetime
from datetime import timedelta
from itertools import islice
%matplotlib inline
import numpy as np
import pandas as pd
# Ploting
import matplotlib.pyplot as plt
from matplotlib import pyplot
from matplotlib.collections import LineCollection
from matplotlib.colors import ListedColormap, BoundaryNorm
import seaborn as sns
# Display
from IPython.display import display, HTML
# Math - ML
import matplotlib.mlab as mlab
from matplotlib.mlab import PCA
import math
# Statistic tool
from statsmodels.tsa.arima_model import ARMA
import statsmodels.api as sm
import statsmodels.stats.diagnostic as tsd
import statsmodels.tsa.stattools as ta
import statsmodels.graphics.tsaplots as taplot
from scipy import stats
import warnings
from statsmodels.nonparametric.kernel_density import KDEMultivariate
from statsmodels.nonparametric.kde import KDEUnivariate
font = {'weight': 'normal','size': 16,}
warnings.filterwarnings('ignore')
base_folder = "monte-carlo-risk/"
factors_folder= base_folder + "factors/"
# read data from local disk
def readInvestingDotComHistory(fname):
def process_line(line):
cols = line.split('\t')
date = datetime.strptime(cols[0], '%b %d, %Y')
value = float(cols[1])
return (date, value)
with open(fname) as f:
content_w_header = f.readlines()
# remove the first line
# and reverse lines to sort the data by date, in ascending order
content = content_w_header[-1:0:-1]
return list(map(process_line , content))
factor1_files = ['crudeoil.tsv', 'us30yeartreasurybonds.tsv']
factor1_files = map(lambda fn: factors_folder + fn, factor1_files)
factors1 = [readInvestingDotComHistory(f) for f in factor1_files]
print(factors1[0][0:5])
print ('Crude Oil data collected in %d years'%(np.round((factors1[0][-1][0]-factors1[0][0][0]).days/365.25)))
print ('Treasury Bonds data collected in %d years'%(np.round(((factors1[1][-1][0]-factors1[1][0][0]).days)/365.25)))
Now, the data structure factors1 is a list, containing data that pertains to two (out of a total of four) factors that influence the market, as obtained by investing.com. Each element in the list is a tuple, containing some sort of timestamp, and the value of one of the two factors discussed above. From now on, we call these elements "records" or "entries". Visually, factors1 looks like this:
| 0 (crude oil) | 1 (US bonds) |
|---|---|
| time_stamp, value | time_stamp, value |
| ... | ... |
| time_stamp, value | time_stamp, value |
| ... | ... |
# read data from local disk
def readYahooHistory(fname):
def process_line(line):
cols = line.split(',')
date = datetime.strptime(cols[0], '%Y-%m-%d')
value = float(cols[1])
return (date, value)
with open(fname) as f:
content_w_header = f.readlines()
# remove the first line
# and reverse lines to sort the data by date, in ascending order
content = content_w_header[-1:0:-1]
return list(map(process_line , content))
factor2_files = ['GSPC.csv','IXIC.csv']
factor2_files = map(lambda fn: factors_folder + fn, factor2_files)
factors2 = [readYahooHistory(f) for f in factor2_files]
print(factors2[0][:5])
print ('GSPC data collected in %d years'%(np.round((factors2[0][-1][0]-factors2[0][0][0]).days/365.25)))
print ('From:',factors2[0][0][0].strftime('%d/%m/%Y'),'to',factors2[0][-1][0].strftime('%d/%m/%Y'))
Now, the data structure factors2 is again list, containing data that pertains to the next two (out of a total of four) factors that influence the market, as obtained by Yahoo!. Each element in the list is a tuple, containing some sort of timestamp, and the value of one of the two factors discussed above. Visually, factors2 looks like this:
| 0 (GSPC) | 1 (IXIC) |
|---|---|
| time_stamp, value | time_stamp, value |
| ... | ... |
| time_stamp, value | time_stamp, value |
| ... | ... |
Next, we prepare the data for the instruments we consider in this Notebook (i.e., the stocks we want to invest in).
from os import listdir
from os.path import isfile, join
stock_folder = base_folder + 'stocks'
def process_stock_file(fname):
try:
return readYahooHistory(fname)
except Exception as e:
raise e
return None
# select path of all stock data files in "stock_folder"
files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
# assume that we invest only the first 35 stocks (for faster computation)
files = files[:35]
# read each line in each file, convert it into the format: (date, value)
rawStocks = [process_stock_file(f) for f in files]
# select only instruments which have more than 5 years of history
# Note: the number of business days in a year is 260
number_of_years = 5
rawStocks = list(filter(lambda instrument:((instrument[-1][0]-instrument[0][0]).days/365) >= 7
, rawStocks))
# For testing, print the first 5 entry of the first stock
print(rawStocks[0][:5])
[(x[0][0],x[-1][0]) for x in rawStocks]
Different types of instruments may trade on different days, or the data may have missing values for other reasons, so it is important to make sure that our different histories align. First, we need to trim all of our time series to the same region in time. Then, we need to fill in missing values. To deal with time series that have missing values at the start and end dates in the time region, we simply fill in those dates with nearby values in the time region.
# note that the data of crude oild and treasury is only available starting from 26/01/2006
start = datetime(year=2009, month=1, day=23)
end = datetime(year=2014, month=1, day=23)
def trimToRegion(history, start, end):
def isInTimeRegion(entry):
(date, value) = entry
return date >= start and date <= end
# only select entries which are in the time region
trimmed = list(filter( isInTimeRegion, history))
# if the data has incorrect time boundaries, add time boundaries
if trimmed[0][0] != start:
trimmed.insert(0, (start, trimmed[0][1]))
if trimmed[-1][0] != end:
trimmed.append((end, trimmed[-1][1]))
return trimmed
# test our function
trimmedStock0 = trimToRegion(rawStocks[0], start, end)
# the first 5 records of stock 0
print(trimmedStock0[:5])
# the last 5 records of stock 0
print(trimmedStock0[-5:])
assert(trimmedStock0[0][0] == start), "the first record must contain the price in the first day of time interval"
assert(trimmedStock0[-1][0] == end), "the last record must contain the price in the last day of time interval"
We expect that we have the price of instruments and factors in each business day. Unfortunately, there are many missing values in our data: this means that we miss data for some days, e.g. we have data for the Monday of a certain week, but not for the subsequent Tuesday. So, we need a function that helps filling these missing values.
Next, we provide to you the function to fill missing value: read it carefully!
def fillInHistory(history, start, end):
curr = history
filled = []
idx = 0
curDate = start
numEntries = len(history)
while curDate < end:
# if the next entry is in the same day
# but the curDate has already skipped it and moved to the next monday
# (only in that case, curr[idx + 1][0] < curDate )
# then move to the next entry
while idx + 1 < numEntries and curr[idx + 1][0] == curDate:
idx +=1
# only add the last value of instrument in a single day
# check curDate is weekday or not
# 0: Monday -> 5: Saturday, 6: Sunday
if curDate.weekday() < 5:
filled.append((curDate, curr[idx][1]))
# move to the next business day
curDate += timedelta(days=1)
# skip the weekends
if curDate.weekday() >= 5:
# if curDate is Sat, skip 2 days, otherwise, skip 1 day
curDate += timedelta(days=(7-curDate.weekday()))
return filled
def fillInHistory_fix(history, start, end):
curr = history
filled = []
idx = 0
curDate = start
numEntries = len(history)
while curDate < end:
while idx + 1 < numEntries and curr[idx + 1][0] == curDate:
idx +=1
if curDate.weekday() < 5:
filled.append((curDate, curr[idx][1]))
curDate += timedelta(days=1)
else:
if curDate.weekday() >= 5:
curDate += timedelta(days = 1)
return filled
def threshold_plot(ax, x, y, threshv, color, overcolor):
"""
Helper function to plot points above a threshold in a different color
Parameters
----------
ax : Axes
Axes to plot to
x, y : array
The x and y values
threshv : float
Plot using overcolor above this value
color : color
The color to use for the lower values
overcolor: color
The color to use for values over threshv
"""
# Create a colormap for red, green and blue and a norm to color
# f' < -0.5 red, f' > 0.5 blue, and the rest green
cmap = ListedColormap([color, overcolor])
norm = BoundaryNorm([np.min(y), threshv, np.max(y)], cmap.N)
# Create a set of line segments so that we can color them individually
# This creates the points as a N x 1 x 2 array so that we can stack points
# together easily to get the segments. The segments array for line collection
# needs to be numlines x points per line x 2 (x and y)
points = np.array([x, y]).T.reshape(-1, 1, 2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
# Create the line collection object, setting the colormapping parameters.
# Have to set the actual values used for colormapping separately.
lc = LineCollection(segments, cmap=cmap, norm=norm)
lc.set_array(y)
ax.add_collection(lc)
ax.set_xlim(np.min(x), np.max(x))
ax.set_ylim(np.min(y)*1.1, np.max(y)*1.1)
return lc
# trim into a specific time region
# and fill up the missing values
stocks = list(map(lambda stock: \
fillInHistory(
trimToRegion(stock, start, end),
start, end),
rawStocks))
# merge two factors, trim each factor into a time region
# and fill up the missing values
allfactors = factors1 + factors2
factors = list(map( lambda factor:
fillInHistory(trimToRegion(factor,start,end),start,end),
allfactors))
def buildWindow(seq, k=2):
"Returns a sliding window (of width k) over data from iterable data structures"
" s -> (s0,s1,...s[k-1]), (s1,s2,...,sk), ... "
it = iter(seq)
result = tuple(islice(it, k))
if len(result) == k:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
def calculateReturn(window):
# return the change of value after two weeks
return window[-1][1] - window[0][1]
def twoWeekReturns(history):
# we use 10 instead of 14 to define the window
# because financial data does not include weekends
return [calculateReturn(entry) for entry in buildWindow(history, 10)]
stocksReturns = list(map(twoWeekReturns, stocks))
factorsReturns = list(map(twoWeekReturns, factors))
pdf = pd.DataFrame(factors[1])
pdf1 = pd.DataFrame(pd.DataFrame(trimToRegion(allfactors[1], start, end)))
plt.figure(figsize=(20,10))
ind = np.arange(len(factorsReturns[0]))
ax = plt.subplot(2,1,1)
data = np.array(factorsReturns[1])
lc = threshold_plot(ax, ind, data, 0, 'r', 'g')
ax.axhline(0, color='k', ls='--')
lc.set_linewidth(2)
plt.title('Treasury Bond',fontsize=24)
plt.figure(figsize=(20,10))
ax2 = plt.subplot(2,1,2)
ax2.set_ylabel("AFTER FILL MISSING DATA")
plt.plot(pdf[0],pdf[1])
ax3 = plt.subplot(2,1,1)
plt.plot(pdf1[0],pdf1[1])
ax3.set_ylabel("BEFORE FILL MISSING DATA")
plt.show()
# trim into a specific time region
# and fill up the missing values
stocks = list(map(lambda stock: \
fillInHistory_fix(
trimToRegion(stock, start, end),
start, end),
rawStocks))
# merge two factors, trim each factor into a time region
# and fill up the missing values
allfactors = factors1 + factors2
factors = list(map( lambda factor:
fillInHistory_fix(trimToRegion(factor,start,end),start,end),
allfactors))
# test our code
print("the first 5 records of stock 0:", stocks[0][:5], "\n")
print("the last 5 records of stock 0:", stocks[0][-5:], "\n")
print("the first 5 records of factor 0:", factors[0][:5], "\n")
print("the first 5 records of factor 0:", factors[0][-5:], "\n")
pd.DataFrame(list(filter(lambda x: (x[0]>datetime(2013,5,8)) & (x[0]<datetime(2013,5,17)),allfactors[1]))
,columns=['date','value'])
print("11/5/2013 weekday is",datetime(2013,5,11).weekday())
Recall that Value at Risk (VaR) deals with losses over a particular time horizon. We are not concerned with the absolute prices of instruments, but how those prices change over a given period of time. In our project, we will set that length to two weeks: we use the sliding window method to transform time series of prices into an overlapping sequence of price change over two-week intervals.
The figure below illustrates this process. The returns of market factors after each two-week interval is calculated in the very same way.
def buildWindow(seq, k=2):
"Returns a sliding window (of width k) over data from iterable data structures"
" s -> (s0,s1,...s[k-1]), (s1,s2,...,sk), ... "
it = iter(seq)
result = tuple(islice(it, k))
if len(result) == k:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
def calculateReturn(window):
# return the change of value after two weeks
return window[-1][1] - window[0][1]
def twoWeekReturns(history):
# we use 10 instead of 14 to define the window
# because financial data does not include weekends
return [calculateReturn(entry) for entry in buildWindow(history, 10)]
stocksReturns = list(map(twoWeekReturns, stocks))
factorsReturns = list(map(twoWeekReturns, factors))
# test our functions
print("the first 5 returns of stock 0:", stocksReturns[0][:5])
print("the last 5 returns of stock 0:", stocksReturns[0][-5:])
plt.figure(figsize=(20,30))
plt.subplots_adjust(hspace=0)
ind = np.arange(len(factorsReturns[0]))
ax1 = plt.subplot(4,1,1)
data = np.array(factorsReturns[0])
lc1 = threshold_plot(ax1, ind, data, 0, 'r', 'g')
ax1.axhline(0, color='k', ls='--')
ax1.axvline(640, color='b', ls='--')
ax1.axvline(720, color='b', ls='--')
ax1.tick_params(axis='y', colors='red',labelsize=15)
ax1.set_ylabel('Oil Factor',fontsize=24)
lc1.set_linewidth(2)
ax2 = plt.subplot(4,1,2,sharex=ax1)
data = np.array(factorsReturns[1])
lc2 = threshold_plot(ax2, ind, data, 0, 'r', 'g')
ax2.axhline(0, color='k', ls='--')
ax2.axvline(640, color='b', ls='--')
ax2.axvline(720, color='b', ls='--')
ax2.tick_params(axis='y', colors='red',labelsize=15)
lc2.set_linewidth(2)
ax2.set_ylabel('Treasury Bond',fontsize=24)
ax3 = plt.subplot(4,1,3,sharex=ax1)
data = np.array(factorsReturns[2])
lc3 = threshold_plot(ax3, ind, data, 0, 'r', 'g')
ax3.axhline(0, color='k', ls='--')
ax3.axvline(640, color='b', ls='--')
ax3.axvline(720, color='b', ls='--')
lc3.set_linewidth(2)
ax3.set_ylabel('GSPC factor',fontsize=24)
ax4 = plt.subplot(4,1,4,sharex=ax1)
data = np.array(factorsReturns[3])
lc4 = threshold_plot(ax4, ind, data, 0, 'r', 'g')
ax4.axhline(0, color='k', ls='--')
ax4.axvline(640, color='b', ls='--')
ax4.axvline(720, color='b', ls='--')
lc4.set_linewidth(2)
ax4.set_ylabel('IXIC factor',fontsize=24)
xticklabels = ax1.get_xticklabels() + ax2.get_xticklabels()
plt.setp(xticklabels, visible=False)
plt.show()
Alright! Now we have data that is properly aligned to start the training process: stocks' returns and factors' returns, per time windows of two weeks. Next, we will apply the MCS method.
Next, we overview the steps that you have to follow to build a model of your data, and then use Monte Carlo simulations to produce output distributions:
In our simulation, we will use a simple linear model. By our definition of return, a factor return is a change in the value of a market factor over a particular time period, e.g. if the value of the S&P 500 moves from 2000 to 2100 over a time interval, its return would be 100.
A vector that contains the return of 4 market factors is called a market factor vector. Generally, instead of using this vector as features, we derive a set of features from simple transformation of it. In particular, a vector of 4 values is transformed into a vector of length $m$ by function $F$. In the simplest case $F(v) = v$.
Denote $v_t$ the market factor vector, and $f_t$ the transformed features of $v_t$ at time $t$.
$f_{tj}$ is the value of feature $j$ in $f_t$.
Denote $r_{it}$ the return of instrument $i$ at time $t$ and $c_i$ the intercept term of instrument $i$.
We will use a simple linear function to calculate $r_{it}$ from $f_t$:
$$ r_{it} = c_i + \sum_{j=1}^{m}{w_{ij}*f_{tj}} $$where $w_{ij}$ is the weight of feature $j$ for instrument $i$.
All that above means that given a market factor vector, we have to apply featurization and then use the result as a surrogate for calculating the return of the instruments, using the above linear function.
There are two questions that we should consider: how we apply featurization to a factor vector? and how to pick values for $w_{ij}$?
How we apply featurization to a factor vector? In fact, the instruments' returns may be non-linear functions of the factor returns. So, we should not use factor returns as features in the above linear function. Instead, we transform them into a set of features with different size. In this Notebook, we can include some additional features in our model that we derive from non-linear transformations of the factor returns. We will try adding two more features for each factor return: its square and its square root values. So, we can still assume that our model is a linear model in the sense that the response variable is a linear function of the new features. Note that the particular feature transformation described here is meant to be an illustrative example of some of the options that are available: it shouldn't be considered as the state of the art in predictive financial modeling!!.
How to pick values for $w_{ij}$?
For all the market factor vectors in our historical data, we transform them to feature vectors. Now, we have feature vectors in many two-week intervals and the corresponding instrument's returns in these intervals. We can use Ordinary Least Square (OLS) regression model to estimate the weights for each instrument such that our linear function can fit to the data. The parameters for OLS function are:
x: The collection of columns where each column is the value of a feature in many two-week intervaly: The return of an instrument in the corresponding time interval of x.The figure below shows the basic idea of the process to build a statistical model for predicting the returns of stock X.

def transpose(matrix):
return np.array(matrix).transpose().tolist()
# test function
assert (transpose([[1,2,3], [4,5,6], [7,8,9]]) == [[1, 4, 7], [2, 5, 8], [3, 6, 9]]), "Function transpose runs incorrectly"
def featurize(factorReturns):
arr = np.array(factorReturns)
squaredReturns = (np.power(arr,2)*np.where(arr>0,1,-1)).tolist()
squareRootedReturns = (np.sqrt(np.abs(arr))*np.where(arr>0,1,-1)).tolist()
# concat new features
return squaredReturns + squareRootedReturns + factorReturns
# test our function
assert (featurize([4, -9, 25]) == [16, -81, 625, 2, -3, 5, 4, -9, 25]), "Function runs incorrectly"
def estimateParams(y, x):
return sm.OLS(y, x).fit().params
# transpose factorsReturns
factorMat = transpose(factorsReturns)
# featurize each row of factorMat
factorFeatures = list(map(featurize,factorMat))
# OLS require parameter is a numpy array
factor_columns = np.array(factorFeatures)
#add a constant - the intercept term for each instrument i.
factor_columns = sm.add_constant(factor_columns, prepend=True)
# estimate weights
weights = [estimateParams(stockReturns,factor_columns) for stockReturns in stocksReturns]
a = pd.DataFrame(weights, columns=['intercept','squaref1','squaref2','squaref3','squaref4',
'rootf1','rootf2','rootf3','rootf4',
'1-Oil','2-Treasury','3-GSPC','4-IXIC'])
display(a)
Since we cannot define the distributions for the market factors directly, we can only approximate their distribution. The best way to do that, is plotting their value. However, these values may fluctuate quite a lot.
Next, we show how to use the Kernel density estimation (KDE) technique to approximate such distributions. In brief, kernel density estimation is a way of smoothing out a histogram: this is achieved by assigning (or centering) a probability distribution (usually a normal distribution) to each data point, and then summing. So, a set of two-week-return samples would result in a large number of "super-imposed" normal distributions, each with a different mean.
To estimate the probability density at a given point, KDE evaluates the PDFs of all the normal distributions at that point and takes their average. The smoothness of a kernel density plot depends on its bandwidth, and the standard deviation of each of the normal distributions. For a brief introduction on KDE, please refer to this link.
def plotDistribution(samples):
vmin = min(samples)
vmax = max(samples)
stddev = np.std(samples)
mu = np.mean(samples)
domain = np.arange(vmin, vmax, (vmax-vmin)/100)
# a simple heuristic to select bandwidth
bandwidth = 1.06 * stddev * pow(len(samples), -.2)
# estimate density
kde = KDEUnivariate(samples)
kde.fit(bw=bandwidth)
density = kde.evaluate(domain)
# plot
plt.hist(samples,normed=1,bins=100,color='b',alpha = 0.5,label='Real data')
plt.plot(domain, density,lw=3,color='r',alpha = 0.85, label= 'KDE estimation')
plt.plot(domain,mlab.normpdf(domain, mu, stddev),'k--',alpha=0.8,label='normal distribution',lw=4)
plt.legend()
plt.figure(figsize=(20,15))
plt.subplot(2,2,1)
plt.title('Crude Oil',fontsize=20)
plotDistribution(factorsReturns[0])
plt.subplot(2,2,2)
plt.title('Treasury Bond',fontsize=20)
plotDistribution(factorsReturns[1])
plt.subplot(2,2,3)
plt.title('GSPC',fontsize=20)
plotDistribution(factorsReturns[2])
plt.subplot(2,2,4)
plt.title('IXIC',fontsize=20)
plotDistribution(factorsReturns[3])
plt.show()
For the sake of simplicity, we can say that our smoothed versions of the returns of each factor can be represented quite well by a normal distribution. Of course, more exotic distributions, perhaps with fatter tails, could fit more closely the data, but it is outside the scope of this Notebook to proceed in this way.
Now, the simplest way to sample factors returns is to use a normal distribution for each of the factors, and sample from these distributions independently. However, this approach ignores the fact that market factors are often correlated. For example, when the price of crude oil is down, the price of treasury bonds is down too. We can check our data to verify about the correlation.

correlation = np.corrcoef(factorsReturns)
correlation
The multivariate normal distribution can help here by taking the correlation information between the factors into account. Each sample from a multivariate normal distribution can be thought of as a vector. Given values for all of the dimensions but one, the distribution of values along that dimension is normal. But, in their joint distribution, the variables are not independent.
For this use case, we can write:
$$ \left(\begin{array}{c}f_{1}\\f_{2}\\f_{3}\\f_{4} \end{array}\right) \sim N \left[ \left( \begin{array}{c} \mu_1\\ \mu_2 \\ \mu_3 \\ \mu_4 \end{array} \right), \left( \begin{array}{cccc} \sigma^2_1 & \rho_{12} \sigma_1\sigma_2 & \rho_{13} \sigma_1\sigma_3 & \rho_{14} \sigma_1\sigma_4 \\ \rho_{12}\sigma_2\sigma_1 & \sigma^2_2 & \rho_{23} \sigma_2\sigma_3 & \rho_{24} \sigma_2\sigma_4\\ \rho_{13} \sigma_3\sigma_1 & \rho_{23} \sigma_3\sigma_2 & \sigma^2_3 & \rho_{34} \sigma_3\sigma_4 \\ \rho_{14} \sigma_4\sigma_1 & \rho_{24} \sigma_4\sigma_2 & \rho_{34} \sigma_3\sigma_4 & \sigma_4^2 \\ \end{array} \right) \right] $$Or,
$$ f_t \sim N(\mu, \sum) $$Where $f_1$, $f_2$, $f_3$ and $f_4$ are the market factors, $\sigma_i$ is the standard deviation of factor $i$, $\mu$ is a vector of the empirical means of the returns of the factors and $\sum$ is the empirical covariance matrix of the returns of the factors.
The multivariate normal is parameterized with a mean along each dimension and a matrix describing the covariance between each pair of dimensions. When the covariance matrix is diagonal, the multivariate normal reduces to sampling along each dimension independently, but placing non-zero values in the off-diagonals helps capture the relationships between variables. Whenever having the mean of this multivariate normal distribution and its covariance matrix, we can generate the sample values for market factors.
Next, we will calculate the mean and the covariance matrix of this multivariate normal distribution from the historical data.
np.cov can help calculating covariance matrix. Function np.random.multivariate_normal(<mean>, <cov>) is often used for generating samples.
factorCov = np.cov(factorsReturns)
factorMeans = [sum(x)/len(x) for x in factorsReturns]
sample = np.random.multivariate_normal(factorMeans,factorCov)
print(factorCov)
print(factorMeans)
print(sample)
We define some functions that helps us calculating VaR 5%. You will see that the functions below are pretty complicated! This is why we provide a solution for you: however, study them well!!
The basic idea of calculating VaR 5% is that we need to find a value such that only 5% of the losses are bigger than it. That means the 5th percentile of the losses should be VaR 5%.
VaR can sometimes be problematic though, since it does give any information on the extent of the losses which can exceed the VaR estimate. CVar is an extension of VaR that is introduced to deal with this problem. Indeed, CVaR measures the expected value of the loss in those cases where VaR estimate has been exceeded.
def fivePercentVaR(trials):
numTrials = trials.count()
topLosses = trials.takeOrdered(max(round(numTrials/20.0), 1))
return topLosses[-1]
# an extension of VaR
def fivePercentCVaR(trials):
numTrials = trials.count()
topLosses = trials.takeOrdered(max(round(numTrials/20.0), 1))
return sum(topLosses)/len(topLosses)
def bootstrappedConfidenceInterval(
trials, computeStatisticFunction,
numResamples, pValue):
stats = []
for i in range(0, numResamples):
resample = trials.sample(True, 1.0)
stats.append(computeStatisticFunction(resample))
stats = sorted(stats)
lowerIndex = int(numResamples * pValue / 2 - 1)
upperIndex = int(np.ceil(numResamples * (1 - pValue / 2)))
return (stats[lowerIndex], stats[upperIndex])
Next, we will run the Monte Carlo simulation 10,000 times, in parallel using Spark. Since your cluster has 12 cores (two Spark worker nodes, each with 6 cores), we can set parallelism = 12 to dispatch simulation on these cores, across the two machines (remember, those are not really "physical machines", they are Docker containers running in our infrastructure).

# RUN SILMULATION
def simulateTrialReturns(numTrials, factorMeans, factorCov, weights):
trialReturns = []
for i in range(0, numTrials):
# generate sample of factors' returns
trialFactorReturns = np.random.multivariate_normal(factorMeans,factorCov)
# featurize the factors' returns
trialFeatures = featurize(trialFactorReturns.tolist())
# insert weight for intercept term
trialFeatures.insert(0,1)
trialTotalReturn = 0
# calculate the return of each instrument
# then calulate the total of return for this trial features
trialTotalReturn = np.sum(np.dot(weights,trialFeatures))
trialReturns.append(trialTotalReturn)
return trialReturns
parallelism = 12
numTrials = 100000
trial_indexes = list(range(0, parallelism))
seedRDD = sc.parallelize(trial_indexes, parallelism)
bFactorWeights = sc.broadcast(weights)
trials = seedRDD.flatMap(lambda idx: \
simulateTrialReturns(
max(int(numTrials/parallelism), 1),
factorMeans, factorCov,
bFactorWeights.value
))
trials.cache()
valueAtRisk = fivePercentVaR(trials)
conditionalValueAtRisk = fivePercentCVaR(trials)
print("Value at Risk(VaR) 5%:", valueAtRisk)
print("Conditional Value at Risk(CVaR) 5%:", conditionalValueAtRisk)
The value of VaR depends on how many invested stocks and the chosen distribution of random variables. Assume that we get VaR 5% = -2.66, that means that there is a 0.05 probability that the portfolio will fall in value by more than \$2.66 over a two weeks' period if there is no trading. In other words, the loses are less than \$2.66 over two weeks' period with 95% confidence level. When a loss over two weeks is more than \$2.66, we call it **failure** (or **exception**). Informally, because of 5% probability, we expect that there are only $0.05*W$ failures out of total $W$ windows.
In general, the error in a Monte Carlo simulation should be proportional to 1/sqrt(n), where n is the number of trials. This means, for example, that quadrupling the number of trials should approximately cut the error in half. A good way to check the quality of a result is backtesting on historical data. Backtesting is a statistical procedure where actual losses are compared to the estimated VaR. For instance, if the confidence level used to calculate VaR is 95% (or VaR 5%), we expect only 5 failures over 100 two-week time windows.
The most common test of a VaR model is counting the number of VaR failures, i.e., in how many windows, the losses exceed VaR estimate. If the number of exceptions is less than selected confidence level would indicate, the VaR model overestimates the risk. On the contrary, if there are too many exceptions, the risk is underestimated. However, it's very hard to observe the amount of failures suggested by the confidence level exactly. Therefore, people try to study whether the number of failures is reasonable or not, or will the model be accepted or rejected.
One common test is Kupiec's proportion-of-failures (POF) test. This test considers how the portfolio performed at many historical time intervals and counts the number of times that the losses exceeded the VaR. The null hypothesis is that the VaR is reasonable, and a sufficiently extreme test statistic means that the VaR estimate does not accurately describe the data. The test statistic is computed as:
$$ -2ln\Bigg(\frac{(1-p)^{T-x}p^x}{(1-\frac{x}{T})^{T-x}(\frac{x}{T})^x}\Bigg) $$where:
$p$ is the quantile-of-loss of the VaR calculation (e.g., in VaR 5%, p=0.05),
$x$ (the number of failures) is the number of historical intervals over which the losses exceeded the VaR
$T$ is the total number of historical intervals considered
Or we can expand out the log for better numerical stability:
$$ \begin{equation} -2\Big((T-x)ln(1-p)+x*ln(p)-(T-x)ln(1-\frac{x}{T})-x*ln(\frac{x}{T})\Big) \end{equation} $$If we assume the null hypothesis that the VaR is reasonable, then this test statistic is drawn from a chi-squared distribution with a single degree of freedom. By using Chi-squared distribution, we can find the p-value accompanying our test statistic value. If p-value exceeds the critical value of the Chi-squared distribution, we do have sufficient evidence to reject the null hypothesis that the model is reasonable. Or we can say, in that case, the model is considered as inaccurate.
For example, assume that we calculate VaR 5% (the confidence level of the VaR model is 95%) and get value VaR = 2.26. We also observed 50 exceptions over 500 time windows. Using the formula above, the test statistic p-value is calculated and equal to 8.08. Compared to 3.84, the critical value of Chi-squared distribution with one degree of freedom at probability 5%, the test statistic is larger. So, the model is rejected. The critical values of Chi-squared can be found by following this link.
However, in this Notebook, it's not a good idea to find the corresponding critical value by looking in a "messy" table, especially when we need to change the confidence level. Instead, from p-value, we will calculate the probability of the test statistic in Chi-square thanks to some functions in package scipy. If the calculated probability is smaller than the quantile of loss (e.g, 0.05), the model is rejected and vice versa.

def countFailures(stocksReturns, valueAtRisk):
failures = 0
# iterate over time intervals
for i in range(0, len(stocksReturns[0])):
# calculate the losses in each time interval
loss = np.sum([x[i] for x in stocksReturns])
# if the loss exceeds VaR
if (loss<valueAtRisk):
failures += 1
return failures
len(stocksReturns)
def kupiecTestStatistic(total, failures, confidenceLevel):
failureRatio = failures/total
logNumer = (total-failures)*np.log(1-confidenceLevel) + failures*np.log(confidenceLevel)
logDenom = (total-failures)*np.log(1-failureRatio) + failures*np.log(failureRatio)
return -2 * (logNumer - logDenom)
# test the function
assert (round(kupiecTestStatistic(250, 36, 0.1), 2) == 4.80), "function kupiecTestStatistic runs incorrectly"
Now we can find the p-value accompanying our test statistic value.
def kupiecTestPValue(stocksReturns, valueAtRisk, confidenceLevel):
failures = countFailures(stocksReturns, valueAtRisk)
print("num failures:", failures, 'over', len(stocksReturns[0]))
if (failures ==0):
failures = 1
total = len(stocksReturns[0])
testStatistic = kupiecTestStatistic(total, failures, confidenceLevel)
#return 1 - stats.chi2.cdf(testStatistic, 1.0)
return stats.chisqprob(testStatistic, 1.0)
varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
print("VaR confidence interval: " , varConfidenceInterval)
print("CVaR confidence interval: " , cvarConfidenceInterval)
print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, valueAtRisk, 0.05))
def ReturnOfAllStock(stocksReturns):
loss = list()
for i in range(0, len(stocksReturns[0])):
loss.append(np.sum([x[i] for x in stocksReturns]))
return loss
def VisualizeAfterModel(samples,stocksReturns):
[vmin,vmax,stddev] = [min(samples),max(samples),np.std(samples)]
domain = np.arange(vmin, vmax, (vmax-vmin)/100)
bandwidth = 1.06 * stddev * pow(len(samples), -.2)
# real loss
loss = ReturnOfAllStock(stocksReturns)
# estimate return from model
es_return = plt.hist(samples,bins=100,normed=1)
plt.clf()
# estimate return using KDE
kde = KDEUnivariate(samples)
kde.fit(bw=bandwidth)
density = kde.evaluate(domain)
# plot
plt.figure(figsize=(15,10))
plt.plot(domain, density,'b-',linewidth=3, label="estimated return after KDE")
plt.hist(loss,bins = 100,normed=1,label="real return",color='g',alpha=0.75)
plt.plot(list(es_return[1][0:100]),list(es_return[0]),'r--',linewidth=2, label="estimated return")
plt.xlabel("Two week return ($)")
plt.ylabel("Density")
plt.legend(loc=2)
plt.show()
VisualizeAfterModel(trials.collect(),stocksReturns)
def Visualize2SampleDistribution(samples1,truesample,dist_name = ''):
# estimate return from model
es_return = plt.hist(samples1,bins=100,normed=1)
plt.cla()
# constraint the axis
vmin = min(truesample)
vmax = max(truesample)
plt.xlim(vmin,vmax)
# plot
if (dist_name != ''):
plt.title("Distribution of the real data and its estimation with " + dist_name + " distribution",fontsize = 20)
plt.hist(truesample,bins = 100,normed=1,label="real data",color='b',alpha=0.6)
plt.plot(list(es_return[1][0:100]),list(es_return[0]),'r--',linewidth=4, label="estimation")
plt.xlabel("Two week return ($)")
plt.ylabel("Density")
plt.legend(loc=2)
def VisualALLSTOCK(numTrials,factorMeans,factorCov,stocksReturns,weights,functiona,dist_name,*arg):
trialReturns = []
for i in range(0, numTrials):
# generate sample of factors' returns
trialFactorReturns = functiona(factorMeans,factorCov,*arg)
# featurize the factors' returns
trialFeatures = featurize(trialFactorReturns.tolist())
# insert weight for intercept term
trialFeatures.insert(0,1)
trialReturns.append(np.dot(weights,trialFeatures))
predictions = pd.DataFrame(trialReturns)
nCols = 3
nRows = int(np.ceil(np.shape(predictions)[1]+2)/nCols)
f , axarr = plt.subplots(nRows, nCols)
f.set_figwidth(20)
f.set_figheight(30)
f.suptitle("Distribution of the real data and its estimation with " + dist_name + " distribution",fontsize = 20)
for idx, stock in enumerate(stocksReturns):
# Simulation
# Plot
i, j = divmod(idx, nCols)
plt.axes(axarr[i, j])
Visualize2SampleDistribution(predictions[idx],stock)
plt.show()
VisualALLSTOCK(150000,factorMeans,factorCov,stocksReturns,weights,np.random.multivariate_normal,'normal')

def getRawStockData(size):
# select path of all stock data files in "stock_folder"
files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
# assume that we invest only the first 35 stocks (for faster computation)
files = files[:size]
# read each line in each file, convert it into the format: (date, value)
rawStocks = [process_stock_file(f) for f in files]
number_of_years = 7
rawStocks = list(filter(lambda instrument:((instrument[-1][0]-instrument[0][0]).days/365) > 7
, rawStocks))
print("the number of stocks we invest is: ", len(rawStocks))
return rawStocks
def simulateTrialReturns(numTrials, factorMeans, factorCov, weights, multivariate_func, *arg):
trialReturns = []
for i in range(0, numTrials):
# generate sample of factors' returns
[trialFactorReturns] = multivariate_func(factorMeans,factorCov,1,*arg)
# featurize the factors' returns
trialFeatures = featurize(list(trialFactorReturns))
# insert weight for intercept term
trialFeatures.insert(0,1)
# calculate the return of each instrument
trialTotalReturn = np.dot(weights,trialFeatures)
# then calulate the total of return for this trial features
trialTotalReturn = sum(trialTotalReturn)
trialReturns.append(trialTotalReturn)
return trialReturns
# OFFLINE VERSION
def MonteCarloGeneralEstimate(newStocksReturns,factorsReturns,numTrials,multivariate_func,*arg):
# Estimate new weight for linear regression model
param = FactorParameter(newStocksReturns,factorsReturnsfactorsReturns)
newTrials = simulateTrialReturns(
numTrials,
param[0], param[1],
param[2],multivariate_func,*arg)
newValueAtRisk = fivePercentVaROff(newTrials)
print("New Value at Risk(VaR) 5%:", newValueAtRisk)
return [newTrials,newValueAtRisk]
def FactorParameter(stocksReturns,factorsReturns):
# Cov and Mean
factorCov = np.cov(factorsReturns)
factorMeans = [sum(x)/len(x) for x in factorsReturns]
# transpose factorsReturns
factorMat = transpose(factorsReturns)
# featurize each row of factorMat
factorFeatures = list(map(featurize,factorMat))
# OLS require parameter is a numpy array
factor_columns = np.array(factorFeatures)
#add a constant - the intercept term for each instrument i.
factor_columns = sm.add_constant(factor_columns, prepend=True)
# Estimate new weight for linear regression model
weights = [estimateParams(stockReturns,factor_columns) for stockReturns in stocksReturns]
return [factorMeans,factorCov,weights]
def MonteCarloGeneralEstimate(newStocksReturns,factorsReturns,numTrials,multivariate_func,*arg):
# Estimate new weight for linear regression model
param = FactorParameter(newStocksReturns,factorsReturns)
#Spark parallel
parallelism = 12
trial_indexes = list(range(0, parallelism))
seedRDD = sc.parallelize(trial_indexes, parallelism)
bNewFactorWeights = sc.broadcast(param[2])
# Simulation
newTrials = seedRDD.flatMap(lambda idx: simulateTrialReturns(
max(int(numTrials / parallelism), 1),
param[0], param[1],
bNewFactorWeights.value,multivariate_func,*arg))
newTrials.cache()
newValueAtRisk = fivePercentVaR(newTrials)
newConditionalValueAtRisk = fivePercentCVaR(newTrials)
print("New Value at Risk(VaR) 5%:", newValueAtRisk)
print ("New Conditional Value at Risk(CVaR) 5%:", newConditionalValueAtRisk)
return [newTrials,newValueAtRisk]
def multivariate_normal(factorMeans,factorCov,n,*arg):
return [np.random.multivariate_normal(factorMeans,factorCov)]
def fivePercentVaROff(newTrials):
newTrials.sort(key=lambda x: x)
return newTrials[round(len(newTrials)/20)]
def HyTestMonteCarlo(trials,stocksReturns,valueAtRisk):
#varConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentVaR, 100, 0.05)
#cvarConfidenceInterval = bootstrappedConfidenceInterval(trials, fivePercentCVaR, 100, .05)
#print("VaR confidence interval: " , varConfidenceInterval)
#print("CVaR confidence interval: " , cvarConfidenceInterval)
print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, valueAtRisk, 0.05))
newrawStocks = getRawStockData(200)
newStocks = list(map(lambda newStock: fillInHistory_fix(trimToRegion(newStock, start, end)
,start, end), newrawStocks))
newStocksReturns = list(map(twoWeekReturns, newStocks))
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(newStocksReturns,factorsReturns,100000,multivariate_normal)
HyTestMonteCarlo(newTrials.collect(),newStocksReturns,newValueAtRisk)
VisualizeAfterModel(newTrials.collect(),newStocksReturns)
print("MAXIMUM return on each stock: ")
np.set_printoptions(precision=3,suppress=True)
print(np.sort(np.max(newStocksReturns,axis=1))[::-1])
print("MINIMUM return on each stock: ")
print(np.sort(np.min(newStocksReturns,axis=1)))
riskylist = np.argsort(np.min(newStocksReturns,axis=1))[0:6]
nRiskyList = len(riskylist)
plt.figure(figsize=(20,30))
for idx,stock in enumerate(riskylist):
data = newStocksReturns[stock]
ax=plt.subplot(nRiskyList,1,idx+1)
threshold_plot(ax,np.arange(len(data)),np.array(data),0,'r','g')
ax.set_ylabel('stockID='+str(stock))
def RemoveRiskStock(newStocksReturns,lowerbound,upperbound,threshold):
stock_minreturn = [len(np.where(np.asarray(x)<lowerbound)[0]) for x in newStocksReturns]
newLowRiskStocksReturns = [newStocksReturns[j] for j,
i in enumerate(stock_minreturn) if i <threshold]
stock_maxreturn = [len(np.where(np.asarray(x)>upperbound)[0]) for x in newLowRiskStocksReturns]
newLowRiskStocksReturns = [newLowRiskStocksReturns[j] for j,
i in enumerate(stock_maxreturn) if i <threshold]
return newLowRiskStocksReturns
newLowRiskStocksReturns = RemoveRiskStock(newStocksReturns,-80,100,8)
print("the number of stock included risky stock is",len(newStocksReturns))
print("the number of stock we invest now is",len(newLowRiskStocksReturns))
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(newLowRiskStocksReturns,factorsReturns,100000,multivariate_normal)
HyTestMonteCarlo(newTrials.collect(),newLowRiskStocksReturns,newValueAtRisk)
VisualizeAfterModel(newTrials.collect(),newLowRiskStocksReturns)
The goal of the Monte Carlo simulation that we want to simulate correctly the distribution of the stock return. Then VaR is conducted from simulation distribution. So if we can test the distribution of simulation match with distribution of real data. So we have more confidence to believe VaR from estimation model.
In addition, we can use it for testing whether the distribution of factor is the prior distribution that we imply before we apply it into regression model.
Comprehensive Reference I recommend :) : http://www.physics.csbsju.edu/stats/KS-test.html
plt.figure(figsize=(20,30))
plt.subplot(4,2,1)
plt.title('Crude Oil',fontsize=20)
plotDistribution(factorsReturns[0])
plt.subplot(4,2,2)
stats.probplot(factorsReturns[0],dist='norm',plot=plt)
plt.subplot(4,2,3)
plt.title('Treasury Bond',fontsize=20)
plotDistribution(factorsReturns[1])
plt.subplot(4,2,4)
stats.probplot(factorsReturns[1],dist='norm',plot=plt)
plt.subplot(4,2,5)
plt.title('GSPC',fontsize=20)
plotDistribution(factorsReturns[2])
plt.subplot(4,2,6)
stats.probplot(factorsReturns[2],dist='norm',plot=plt)
plt.subplot(4,2,7)
plt.title('IXIC',fontsize=20)
plotDistribution(factorsReturns[3])
plt.subplot(4,2,8)
stats.probplot(factorsReturns[3],dist='norm',plot=plt)
plt.show()
def KStestNormal(data):
return stats.ks_2samp(data,stats.norm.rvs(np.mean(data), np.std(data), 100000))
print("TESTING WHETHER 4 FACTORS DISTRIBUTION IS NORMAL OR NOT")
print("Oil: p-value=", KStestNormal(factorsReturns[0])[1])
print("Treasury: p-value=", KStestNormal(factorsReturns[1])[1])
print("GSPC: p-value=", KStestNormal(factorsReturns[2])[1])
print("IXIC: p-value=", KStestNormal(factorsReturns[3])[1])
plt.figure
VisualizeAfterModel(trials.collect(),stocksReturns)
pp_y = sm.ProbPlot(np.array(ReturnOfAllStock(stocksReturns)),fit=True)
pp_x = sm.ProbPlot(np.array(trials.collect())[0:len(np.array(ReturnOfAllStock(stocksReturns)))],fit=True)
fig = pp_x.qqplot(line='45',other=pp_y)
plt.show()
allStocksReturn = ReturnOfAllStock(stocksReturns)
print("Stocks Return from real data vs from our simulation: p-value=",
stats.ks_2samp(trials.collect(),allStocksReturn)[1])

Now I tried to sample market factors without corellation information between them to see how worse it will
def getIndependentCovMatrix(factorCov):
indepenetFactorFov = [[0]*4 for i in range(0,4)]
for i in range(4):
indepenetFactorFov[i][i] = factorCov[i][i]
return indepenetFactorFov
print(*('[{0:.4f} {1:.4f} {2:.4f} {3:.4f}]'.format(*r) for r in getIndependentCovMatrix(factorCov)), sep='\n')
def MonteCarloRemoveDependentEstimate(newStocksReturns,factorsReturns,numTrials,multivariate_func,*arg):
# Estimate new weight for linear regression model
param = FactorParameter(newStocksReturns,factorsReturns)
parallelism = 12
trial_indexes = list(range(0, parallelism))
seedRDD = sc.parallelize(trial_indexes, parallelism)
bNewFactorWeights = sc.broadcast(param[2])
# Simulation
newTrials = seedRDD.flatMap(lambda idx: simulateTrialReturns(
max(int(numTrials / parallelism), 1),
param[0], getIndependentCovMatrix(param[1]),
bNewFactorWeights.value,multivariate_func,*arg))
newTrials.cache()
newValueAtRisk = fivePercentVaR(newTrials)
newConditionalValueAtRisk = fivePercentCVaR(newTrials)
print("New Value at Risk(VaR) 5%:", newValueAtRisk)
print ("New Conditional Value at Risk(CVaR) 5%:", newConditionalValueAtRisk)
return [newTrials,newValueAtRisk]
[newTrials,newValueAtRisk] = MonteCarloRemoveDependentEstimate(stocksReturns, factorsReturns,200000,multivariate_normal)
HyTestMonteCarlo(newTrials.collect(),stocksReturns,newValueAtRisk)
VisualizeAfterModel(newTrials.collect(),stocksReturns)

# all support distribution
dist_names = [ 'alpha', 'beta', 'betaprime', 'burr', 'chi', 'chi2', 'erlang', 'expon', 'exponweib', 'exponpow', 'f',
'fatiguelife', 'fisk', 'genlogistic', 'genexpon', 'genextreme', 'gausshyper', 'gamma', 'gengamma',
'gompertz', 'gumbel_r', 'gumbel_l', 'invgamma', 'invgauss', 'invweibull', 'johnsonsb', 'johnsonsu',
'ksone', 'laplace', 'logistic', 'loggamma', 'loglaplace', 'lognorm', 'lomax', 'maxwell', 'mielke',
'nakagami', 'norm', 'pareto', 'pearson3', 'powerlognorm', 'powernorm', 'reciprocal', 'rayleigh',
'rice', 'recipinvgauss', 't', 'tukeylambda', 'wald', 'weibull_min']
def VisualizeKGoodDistribution(samples,dist_names):
vmin = min(samples)
vmax = max(samples)
stddev = np.std(samples)
mu = np.mean(samples)
domain = np.arange(vmin, vmax, (vmax-vmin)/100)
# a simple heuristic to select bandwidth
bandwidth = 1.06 * stddev * pow(len(samples), -.2)
# estimate density
kde = KDEUnivariate(samples)
kde.fit(bw=bandwidth)
density = kde.evaluate(domain)
# plot
plt.figure(figsize=(20,10))
plt.plot(domain, density,'r--',lw=5,alpha = 0.85, label= 'KDE estimation')
for dist_name in dist_names:
# get distribution
dist = getattr(stats, dist_name)
# finding the best parameter
param = dist.fit(samples)
# retrieve data from pdf function and plot
pdf_fitted = dist.pdf(domain, *param[:-2], loc=param[-2], scale=param[-1])
plt.plot(domain,pdf_fitted, label=dist_name,lw=3)
plt.legend(loc='upper right')
plt.show()
# Find the best distribution
def FindGoodDistribution(samples,dist_names,topk):
pvalueList = list()
for dist_name in dist_names:
# get distribution
dist = getattr(stats, dist_name)
# find parameter
param = dist.fit(samples)
# sample from distribution
estimateSamples = dist.rvs(*param,size=10000)
# do KS test
pvalueList.append([dist_name,stats.ks_2samp(samples,estimateSamples)[1]])
# sort by p-value
pvalueList.sort(key=lambda x: -x[1])
return pvalueList[:topk]
topPvalueList = FindGoodDistribution(factorsReturns[0],dist_names,10)
print(topPvalueList)
VisualizeKGoodDistribution(factorsReturns[0],[i[0] for i in topPvalueList][:5])
topPvalueList = FindGoodDistribution(factorsReturns[1],dist_names,10)
print(topPvalueList)
VisualizeKGoodDistribution(factorsReturns[1],[i[0] for i in topPvalueList][:5])
topPvalueList = FindGoodDistribution(factorsReturns[2],dist_names,10)
print(topPvalueList)
VisualizeKGoodDistribution(factorsReturns[2],[i[0] for i in topPvalueList][:5])
topPvalueList = FindGoodDistribution(factorsReturns[3],dist_names,10)
print(topPvalueList)
VisualizeKGoodDistribution(factorsReturns[3],[i[0] for i in topPvalueList][:5])

#written by Enzo Michelangeli, style changes by josef-pktd
# Student's T random variable
def multivariate_t_rvs(m, S, n=1, df=np.inf):
'''generate random variables of multivariate t distribution
Parameters
----------
m : array_like
mean of random variable, length determines dimension of random variable
S : array_like
square array of covariance matrix
df : int or float
degrees of freedom
n : int
number of observations, return random array will be (n, len(m))
Returns
-------
rvs : ndarray, (n, len(m))
each row is an independent draw of a multivariate t distributed
random variable
'''
m = np.asarray(m)
d = len(m)
if df == np.inf:
x = 1.
else:
x = np.random.chisquare(df, n)/df
z = np.random.multivariate_normal(np.zeros(d),S,(n,))
return m + z/np.sqrt(x)[:,None] # same output format as random.multivariate_normal
def FindParameterDistribution(samples,dist_name):
dist = getattr(stats, dist_name)
# find parameter
param = dist.fit(samples)
return param
bestParameterT = list()
for i in range(0,4):
bestParameterT.append(FindParameterDistribution(factorsReturns[i],'t'))
df_bpTdistribution = pd.DataFrame(bestParameterT,columns=['df','mean','variance'])
df_bpTdistribution['label'] = ['Oil','Treasury','GSPCS','IXIC']
df_bpTdistribution = df_bpTdistribution.set_index('label')
df_bpTdistribution.index.name = None
df_bpTdistribution
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(stocksReturns,factorsReturns,200000,multivariate_t_rvs,6)
HyTestMonteCarlo(newTrials,stocksReturns,newValueAtRisk)
VisualizeAfterModel(newTrials.collect(),stocksReturns)
def VisualALLSTOCK(numTrials,stocksReturns,factorsReturns,functiona,*arg):
trialReturns = []
param = FactorParameter(stocksReturns,factorsReturns)
for i in range(0, numTrials):
# generate sample of factors' returns
trialFactorReturns = functiona(param[0],param[1],1,*arg)[0]
# featurize the factors' returns
trialFeatures = featurize(trialFactorReturns.tolist())
# insert weight for intercept term
trialFeatures.insert(0,1)
trialReturns.append(np.dot(param[2],trialFeatures))
predictions = pd.DataFrame(trialReturns)
nCols = 3
nRows = int(np.ceil(np.shape(predictions)[1]+2)/nCols)
f , axarr = plt.subplots(nRows, nCols)
f.suptitle("Variations distributions of all stocks return",fontsize=24)
f.set_figwidth(20)
f.set_figheight(30)
for idx, stock in enumerate(stocksReturns):
# Simulation
# Plot
i, j = divmod(idx, nCols)
plt.axes(axarr[i, j])
Visualize2SampleDistribution(predictions[idx],stock)
plt.show()
VisualALLSTOCK(150000,stocksReturns,factorsReturns,multivariate_t_rvs,6)
df = np.arange(3,9,0.5)
nCols = 3
nRows = int(np.ceil(len(df))/nCols)
f , axarr = plt.subplots(nRows, nCols)
f.suptitle("Variations distributions of all stocks return", fontsize=20)
f.set_figwidth(20)
f.set_figheight(15)
for idx, dfvalue in enumerate(df):
# Simulation
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(stocksReturns,factorsReturns,150000,multivariate_t_rvs,dfvalue)
# Test
print("df=",dfvalue,"Kupiec test p-value: " , kupiecTestPValue(stocksReturns, newValueAtRisk, 0.05))
# Plot
i, j = divmod(idx, nCols)
plt.axes(axarr[i, j])
Visualize2SampleDistribution(newTrials.collect(),ReturnOfAllStock(stocksReturns))
plt.title('df='+str(dfvalue),fontsize=16)
f.tight_layout()
plt.show()
where Φ is the cumulative distribution function of the normal distribution.def sampleJohnsonsu(a,b,mu,sigma,n):
return sigma*np.sinh((stats.norm.ppf(np.random.uniform(size=n))-a)/b)+mu
# RUN SILMULATION
def multivariate_johnsonsu_rvs(mean,S,n,a,b,mu,sigma):
d = len(mu)
z = np.random.multivariate_normal(np.zeros(d),S,(n,))
z = z/np.array([np.sqrt(S[i][i]) for i in range(len(S))])
return (sigma*np.sinh((z-a)/b)+mu).tolist()
def FindParameterForDistribution(factorsReturns,dict_name):
param = []
for i in range (0,len(factorsReturns)):
param.append(FindParameterDistribution(factorsReturns[i],dict_name))
return np.array(param).T
param =FindParameterDistribution(factorsReturns[0],'johnsonsu')
# from our convert
plt.hist(sampleJohnsonsu(*param,100000),bins=100,normed=1)
# from library stats
y,binEdges=np.histogram(stats.johnsonsu.rvs(*param,100000),bins=100,normed=1)
bincenters = 0.5*(binEdges[1:]+binEdges[:-1])
plt.plot(bincenters,y,'r-',lw=3)
plt.show()
def MonteCarloGeneralEstimate(newStocksReturns,factorsReturns,numTrials,multivariate_func,*arg):
# Estimate new weight for linear regression model
param = FactorParameter(newStocksReturns,factorsReturns)
# run parallel
parallelism = 12
trial_indexes = list(range(0, parallelism))
seedRDD = sc.parallelize(trial_indexes, parallelism)
bNewFactorWeights = sc.broadcast(param[2])
newTrials = seedRDD.flatMap(lambda idx: simulateTrialReturns(
max(int(numTrials / parallelism), 1),
param[0], param[1],
bNewFactorWeights.value,multivariate_func,*arg))
newTrials.cache()
newValueAtRisk = fivePercentVaR(newTrials)
newConditionalValueAtRisk = fivePercentCVaR(newTrials)
print("New Value at Risk(VaR) 5%:", newValueAtRisk)
print ("New Conditional Value at Risk(CVaR) 5%:", newConditionalValueAtRisk)
return [newTrials,newValueAtRisk]
# Optimise parameter
arg =FindParameterForDistribution(factorsReturns,'johnsonsu')
# Sample
sample = multivariate_johnsonsu_rvs(factorMeans,factorCov,200000,*arg)
sample = pd.DataFrame(sample)
titleList = ['Crude Oil','Treasury','GSPC','IXIC']
# plot
nCols = 2
nRow = 2
f , axarr = plt.subplots(2, 2)
f.suptitle("Distribution of the real data and its estimation with johnson-SU distribution",fontsize = 20)
f.set_figwidth(20)
f.set_figheight(15)
for idx, stock in enumerate(factorsReturns):
# Simulation
# Plot
i, j = divmod(idx, nCols)
plt.axes(axarr[i, j])
Visualize2SampleDistribution(sample[idx],factorsReturns[idx])
plt.title(titleList[idx],fontsize=20)
plt.show()
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(stocksReturns,factorsReturns,
200000,multivariate_johnsonsu_rvs,*arg)
HyTestMonteCarlo(newTrials,stocksReturns,newValueAtRisk)
VisualizeAfterModel(newTrials.collect(),stocksReturns)
Kolmogorov_Smirnov test is an option but it is more sensitive to deviations in the center than in the tails. Anderson Darling test statistic puts more weight in the tails than the KS-test.
random variable $X$ is log-normally distributed, then $Y = ln ( X )$ has a normal distribution. Likewise, if $Y$ has a normal distribution. Log-normal has one main characteristic that this distribution is skewed to the right and has long tail. So that is why it is usually used in finance. To implement multivariate follow the formula
$X=e^{\mu +\sigma Z}$
with $Z$ a standard normal variable. $\mu$ and $\sigma$ in here are parameters of normal distribution and we can estimate it by using the log value of our factor data. Because log normal has a lower bound of zero so we have to shift data to the right and reverse back after we sampling.# RUN SILMULATION
def multivariate_lognorm_rvs(mean,var,n,posi=0,*arg):
z = np.random.multivariate_normal(mean,var,(n,))
return (np.exp(z)+np.array(posi)).tolist()
# RUN SILMULATION
def multivariate_lognorm_rvs_v2(mean,var,n,posi,lmean,lvar):
z = np.random.multivariate_normal(np.zeros(len(mean)),var,(n,))[0]
z = z/np.array([np.sqrt(var[i][i]) for i in range(len(var))])
return [(np.exp(np.array([np.sqrt(lvar[i][i]) for i in range(len(lvar))])
*z+np.array(lmean))+np.array(posi)).tolist()]
## convert data to positive value
# - lowest value in each factor
posi = list()
# - log value of the factor
logFactorsReturn = list()
# - positive value of the factor
posvalue=list()
for factor in factorsReturns:
posi.append(np.min(factor)-1)
posvalue.append(np.array(factor)-posi[-1])
logFactorsReturn.append(np.log(np.array(factor)-posi[-1]))
# - mean and variance of log-norm distribution
logFactorsMean = [np.mean(i) for i in logFactorsReturn]
logFactorsVar = np.cov(logFactorsReturn)
factorsMean = [np.mean(i) for i in posvalue]
factorsVar = np.cov(posvalue)
# - sampling 1e6 samples
trials = list()
for i in range(int(3e5)):
[trial] = multivariate_lognorm_rvs(logFactorsMean,logFactorsVar,1,posi)
trials.append(trial)
trials = np.transpose(trials)
plt.figure(figsize=(20,6))
ax = plt.subplot(1,2,1)
Visualize2SampleDistribution(trials[0],factorsReturns[0])
ax.set_title('Crude Oil',fontsize=20)
ax = plt.subplot(1,2,2)
Visualize2SampleDistribution(trials[1],factorsReturns[1])
ax.set_title('Treasury',fontsize=20)
plt.suptitle('',fontsize =20)
trials = list()
for i in range(int(3e5)):
[trial] = multivariate_lognorm_rvs_v2(factorsMean,factorsVar,1,posi,logFactorsMean,logFactorsVar)
trials.append(trial)
trials = np.transpose(trials)
plt.figure(figsize=(20,6))
ax = plt.subplot(1,2,1)
Visualize2SampleDistribution(trials[0],factorsReturns[0])
ax.set_title('Crude Oil',fontsize=20)
ax = plt.subplot(1,2,2)
Visualize2SampleDistribution(trials[1],factorsReturns[1])
ax.set_title('Treasury',fontsize=20)
trials = list()
for i in range(int(1e5)):
[trial] = multivariate_normal(logFactorsMean,logFactorsVar,1)
trials.append(trial)
trials = np.transpose(trials)
plt.figure(figsize=(20,6))
ax = plt.subplot(1,2,1)
Visualize2SampleDistribution(trials[2],logFactorsReturn[2])
ax.set_title('Crude Oil',fontsize=20)
ax = plt.subplot(1,2,2)
Visualize2SampleDistribution(trials[3],logFactorsReturn[3])
ax.set_title('Treasury',fontsize=20)
plt.suptitle('',fontsize =20)
param = [posi,logFactorsMean,logFactorsVar]
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(stocksReturns,factorsReturns,
200000,multivariate_lognorm_rvs_v2,*param)
HyTestMonteCarlo(newTrials,stocksReturns,newValueAtRisk)
VisualizeAfterModel(newTrials.collect(),stocksReturns)
pp_y = sm.ProbPlot(np.array(ReturnOfAllStock(stocksReturns)),fit=True)
pp_x = sm.ProbPlot(np.array(newTrials.collect())[0:len(np.array(ReturnOfAllStock(stocksReturns)))],fit=True)
fig = pp_x.qqplot(line='45',other=pp_y)
plt.show()
allStocksReturn = ReturnOfAllStock(stocksReturns)
print("Stocks Return from real data vs from our simulation: p-value=",
stats.ks_2samp(newTrials.collect(),allStocksReturn)[1])
# Sample
sample = multivariate_t_rvs(factorMeans,factorCov,200000,5.5)
sample = pd.DataFrame(sample)
titleList = ['Crude Oil','Treasury','GSPC','IXIC']
nCols = 2
nRow = 2
f , axarr = plt.subplots(2, 2)
f.suptitle("Distribution of the real data and its estimation with t-distribution",fontsize = 20)
f.set_figwidth(20)
f.set_figheight(15)
for idx, stock in enumerate(factorsReturns):
i, j = divmod(idx, nCols)
plt.axes(axarr[i, j])
Visualize2SampleDistribution(sample[idx],factorsReturns[idx])
plt.title(titleList[idx],fontsize=20)
plt.show()
PCA_model = PCA(np.asarray(factorsReturns).T)
# project(x, minfrac=0.0) : project new data into new dimension
factorsReturn_PCA = PCA_model.Y.T
correlation = np.corrcoef(factorsReturn_PCA)
np.set_printoptions(precision=2)
print(correlation)
plt.figure(figsize=(20,15))
plt.subplot(2,2,1)
plt.title('Crude Oil in PCA',fontsize=20)
plotDistribution(factorsReturn_PCA[0])
plt.subplot(2,2,2)
plt.title('Treasury Bond in PCA',fontsize=20)
plotDistribution(factorsReturn_PCA[1])
plt.subplot(2,2,3)
plt.title('GSPC in PCA',fontsize=20)
plotDistribution(factorsReturn_PCA[2])
plt.subplot(2,2,4)
plt.title('IXIC in PCA',fontsize=20)
plotDistribution(factorsReturn_PCA[3])
plt.show()
newrawStocks = getRawStockData(35)
newStocks = list(map(lambda newStock: fillInHistory_fix(trimToRegion(newStock, start, end)
,start, end), newrawStocks))
newStocksReturns = list(map(twoWeekReturns, newStocks))
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(newStocksReturns,factorsReturn_PCA,200000,multivariate_normal)
print("Kupiec test p-value: " , kupiecTestPValue(newStocksReturns, newValueAtRisk, 0.05))
VisualizeAfterModel(newTrials.collect(),newStocksReturns) #.collect()
In this part, we try to sample independently each variables so check whether it is good as using multivariate sampling as above
def multipleSamplingParameter(factorsReturns,*distlist):
samplingParam = list()
for idx in range(len(factorsReturns)):
param =FindParameterDistribution(factorsReturns[idx],distlist[idx])
samplingParam.append(param)
return samplingParam
def samplingIndependentGeneral(numTrials, distName, *distParam):
dist = getattr(stats, distName)
return dist.rvs(*distParam,size=1)[0]
def simulateIndependentTrialReturns(numTrials, weights, numFactors, *distlist):
trialReturns = []
for i in range(0, numTrials):
# generate sample of factors' returns
trialFactorReturns = list()
for idx in np.arange(numFactors):
trialFactorReturns.append(samplingIndependentGeneral(1,distlist[idx],*distlist[numFactors+idx]).tolist())
# featurize the factors' returns
trialFeatures = featurize(list(trialFactorReturns))
# insert weight for intercept term
trialFeatures.insert(0,1)
# calculate the return of each instrument
trialTotalReturn = np.dot(weights,trialFeatures)
# then calulate the total of return for this trial features
trialTotalReturn = sum(trialTotalReturn)
trialReturns.append(trialTotalReturn)
return trialReturns
def MonteCarloIndependentEstimate(newStocksReturns,factorsReturns,numTrials,*distlist):
# Estimate new weight for linear regression model
param = FactorParameter(newStocksReturns,factorsReturns)
# Find distribution parameter for input factor
distParam = multipleSamplingParameter(factorsReturns,*distlist)
#print(distParam)
# run parallel
parallelism = 12
trial_indexes = list(range(0, parallelism))
seedRDD = sc.parallelize(trial_indexes, parallelism)
bNewFactorWeights = sc.broadcast(param[2])
newTrials = seedRDD.flatMap(lambda idx: simulateIndependentTrialReturns(
max(int(numTrials / parallelism), 1),
bNewFactorWeights.value,len(factorsReturns),*distlist,*distParam))
newTrials.cache()
newValueAtRisk = fivePercentVaR(newTrials)
newConditionalValueAtRisk = fivePercentCVaR(newTrials)
print("New Value at Risk(VaR) 5%:", newValueAtRisk)
print ("New Conditional Value at Risk(CVaR) 5%:", newConditionalValueAtRisk)
return [newTrials,newValueAtRisk]
distList = ['norm','norm','norm','norm']
[newTrials,newValueAtRisk] = MonteCarloIndependentEstimate(newStocksReturns,factorsReturn_PCA,300000,*distList)
HyTestMonteCarlo(newTrials,newStocksReturns,newValueAtRisk)
VisualizeAfterModel(newTrials.collect(),newStocksReturns)

bestParameterT = list()
for i in range(0,4):
bestParameterT.append(FindParameterDistribution(factorsReturn_PCA[i],'t'))
df_bpTdistribution = pd.DataFrame(bestParameterT,columns=['df','mean','variance'])
df_bpTdistribution['label'] = ['Oil_PCA','Treasury_PCA','GSPC_PCA','IXIC_PCA']
df_bpTdistribution = df_bpTdistribution.set_index('label')
df_bpTdistribution.index.name = None
df_bpTdistribution
distList = ['t','t','t','t']
[newTrials,newValueAtRisk] = MonteCarloIndependentEstimate(newStocksReturns,factorsReturn_PCA,100000,*distList)
print("Kupiec test p-value: " , kupiecTestPValue(newStocksReturns, newValueAtRisk, 0.05))
VisualizeAfterModel(newTrials.collect(),newStocksReturns)
topPvalueList = FindGoodDistribution(factorsReturn_PCA[0],dist_names,10)
print(topPvalueList)
VisualizeKGoodDistribution(factorsReturn_PCA[0],[i[0] for i in topPvalueList][:5])
topPvalueList = FindGoodDistribution(factorsReturn_PCA[1],dist_names,10)
print(topPvalueList)
VisualizeKGoodDistribution(factorsReturn_PCA[1],[i[0] for i in topPvalueList][:5])
topPvalueList = FindGoodDistribution(factorsReturn_PCA[2],dist_names,10)
print(topPvalueList)
VisualizeKGoodDistribution(factorsReturn_PCA[2],[i[0] for i in topPvalueList][:5])
topPvalueList = FindGoodDistribution(factorsReturn_PCA[3],dist_names,10)
print(topPvalueList)
VisualizeKGoodDistribution(factorsReturn_PCA[3],[i[0] for i in topPvalueList][:5])
distList = ['gengamma','genlogistic','t','logistic']
[newTrials,newValueAtRisk] = MonteCarloIndependentEstimate(newStocksReturns,factorsReturn_PCA,100000,*distList)
print("Kupiec test p-value: " , kupiecTestPValue(newStocksReturns, newValueAtRisk, 0.05))
VisualizeAfterModel(newTrials.collect(),newStocksReturns)
def FindParamRegression(factorsReturns,stocksReturns):
# transpose factorsReturns
factorMat = transpose(factorsReturns)
# featurize each row of factorMat
factorFeatures = list(map(featurize,factorMat))
# OLS require parameter is a numpy array
factor_columns = np.array(factorFeatures)
#add a constant - the intercept term for each instrument i.
factor_columns = sm.add_constant(factor_columns, prepend=True)
# Estimate new weight for linear regression model
weights = [estimateParams(stockReturns,factor_columns) for stockReturns in stocksReturns]
return [factor_columns,weights]
def CompareMSEbyFeature(stocksReturns,feature1,feature2):
#get prediction origin
[factor_columns,weights] = FindParamRegression(feature1[1],stocksReturns)
weights = np.array(weights).T
prediction = np.dot(factor_columns,weights).T
# get prediction PCA
[factor_columns,weights] = FindParamRegression(feature2[1],stocksReturns)
weights = np.array(weights).T
prediction_PCA = np.dot(factor_columns,weights).T
# compute MSE
MSE = np.sum((prediction-np.array(stocksReturns))**2,axis=1)/len(stocksReturns[0])
MSE_n = np.sum((prediction_PCA-np.array(stocksReturns))**2,axis=1)/len(stocksReturns[0])
x = np.arange(1,len(MSE)+1)
# plot data
#sns.set_context("notebook", font_scale=1.2)
plt.figure(figsize=(20,6))
plt.plot(x,-MSE_n+MSE,'b-o',linewidth=1, label="Amount of MSE reduces compared with ORIGIN",alpha=0.75)
plt.bar(x,MSE,0.8, color='c',label="MSE-"+feature1[0], alpha = 1,align='center')
plt.bar(x,MSE_n,0.8,color='r', label="MSE-"+feature2[0],alpha=0.5,align='center')
plt.xlabel("Stock ID")
plt.ylabel("Mean Square Error")
plt.legend(loc=0,fontsize = 15)
plt.show()
CompareMSEbyFeature(stocksReturns,("Origin",factorsReturns),("PCA",factorsReturn_PCA))
def calculateReturnPercentage(window):
# return the change of value after two weeks
return (window[-1][1] - window[0][1])/float(window[0][1])
def twoWeekReturns(history,ReturnFunction):
# we use 10 instead of 14 to define the window
# because financial data does not include weekends
return [ReturnFunction(entry) for entry in buildWindow(history, 10)]
stocksReturnsPercent= list(map(lambda x:twoWeekReturns(x,calculateReturnPercentage), stocks))
factorsReturnsPercent = list(map(lambda x:twoWeekReturns(x,calculateReturnPercentage), factors))
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(stocksReturnsPercent,factorsReturnsPercent,200000,multivariate_normal)
HyTestMonteCarlo(newTrials,stocksReturnsPercent,newValueAtRisk)
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(stocksReturnsPercent,factorsReturnsPercent,200000,multivariate_t_rvs,4)
HyTestMonteCarlo(newTrials,stocksReturnsPercent,newValueAtRisk)
CompareMSEbyFeature(stocksReturns,("Origin",factorsReturns),("Percentage",factorsReturnsPercent))
plt.figure(figsize=(20,10))
ind = np.arange(len(stocksReturns[0]))
ax1 = plt.subplot(2,1,1)
data = np.array(stocksReturns[0])
lc1 = threshold_plot(ax1, ind, data, 0, 'r', 'g')
ax1.axhline(0, color='k', ls='--')
ax1.tick_params(axis='y', colors='red',labelsize=15)
ax1.set_ylabel('Frist Stock Return',fontsize=24)
ax2 = plt.subplot(2,1,2)
plotDistribution(stocksReturns[0])
ax2.set_ylabel('Frist Stock Distribution',fontsize=24)
plt.show()
fig = plt.figure(figsize=(20,12))
ax1 = fig.add_subplot(2,1,1)
fig = taplot.plot_acf(stocksReturns[0],lags=100,ax=ax1,alpha=0.01)
ax1.set_xticks(np.arange(0, 105, 5))
ax2 = fig.add_subplot(2,1,2)
fig = taplot.plot_pacf(stocksReturns[0],lags=100,ax=ax2,alpha=0.01)
ax2.set_xticks(np.arange(0, 105, 5))
plt.show()
model_fit = sm.tsa.ARIMA(stocksReturns[0], (1,0,8)).fit(disp=1)
modelSummary = model_fit.summary()
print(modelSummary.tables[0],'\n',modelSummary.tables[1])
def ARDataExtract(stocksReturns,factorsReturns,lag):
stockReturn1lag = list()
for i in range(lag,0,-1):
stockReturn1lag.append([stock[lag-i:-i] for stock in stocksReturns])
newStockReturn = [stock[lag:] for stock in stocksReturns]
newFactor = [factor[lag:] for factor in factorsReturns]
return [newStockReturn,newFactor,stockReturn1lag]
def FindParamAutoRegression(factorsReturns,stockReturns,stockReturnKlag,flag=0):
# transpose factorsReturns
factorMat = transpose(factorsReturns)
# featurize each row of factorMat
factorFeatures = list(map(featurize,factorMat))
# OLS require parameter is a numpy array
factor_columns = np.array(factorFeatures)
#add a constant - the intercept term for each instrument i.
factor_columns = sm.add_constant(factor_columns, prepend=True)
# Estimate new weight for linear regression model
new_factor_columns = list()
weights = list()
for ind,stock in enumerate(stockReturns):
if (flag ==0):
# AR case
extra_factor = np.asarray(GetKLagFactor(stockReturnKlag,ind)).T
else:
# MA case
extra_factor = np.asarray(stockReturnKlag[ind])[:, None]
# add to factor columns
new_factor_column = np.append(factor_columns,extra_factor,axis=1)
new_factor_columns.append(new_factor_column)
# fit with regression model
weights.append(estimateParams(stock,new_factor_column))
return [np.asanyarray(new_factor_columns),weights]
#get prediction
[factor_columns,weights] = FindParamRegression(factorsReturns,stocksReturns)
weights = np.array(weights).T
prediction = np.dot(factor_columns,weights).T
# get prediction new
[newStockReturn,newFactor,stockReturn1lag] = ARDataExtract(stocksReturns,factorsReturns,1)
[factor_columns,weights] = FindParamAutoRegression(newFactor,newStockReturn,stockReturn1lag)
weights = np.array(weights)
# predict for each stock
prediction_n = list()
for ind,stock1lag in enumerate(stockReturn1lag[0]):
# add to factor columns
prediction_n.append(np.dot(factor_columns[ind],weights[ind]).T.tolist())
# compute MSE
MSE = np.sum((prediction-np.array(stocksReturns))**2,axis=1)/len(stocksReturns[0])
MSE_n = np.sum((prediction_n-np.array(newStockReturn))**2,axis=1)/len(newStockReturn[0])
x = np.arange(1,len(MSE)+1)
# plot data
#sns.set_context("notebook", font_scale=1.2)
plt.figure(figsize=(15,6))
plt.plot(x,-MSE_n+MSE,'b-o',linewidth=1, label="Difference 2 MSE",alpha=0.75)
plt.bar(x,MSE,0.8, color='c',label="MSE - Origin", alpha = 1,align='center')
plt.bar(x,MSE_n,0.8,color='r', label="MSE - AR 1 order ",alpha=0.9,align='center')
plt.xlim(0,len(weights)+1)
plt.xlabel("Stock ID")
plt.ylabel("Mean Square Error")
plt.legend(loc=1)
plt.show()
def GetKLagFactor(stockReturnLag,stockIdx):
re = list()
for stock in stockReturnLag:
re.append(stock[stockIdx])
return re
def ARPrediction(weights,stockReturnKlag, new_factor_columns):
prediction_n = list()
for ind,weight in enumerate(weights):
# get stock factor
ARfactor = GetKLagFactor(stockReturnKlag,ind)
# fit with regression model
prediction_n.append(np.dot(new_factor_columns[ind],weight).T.tolist())
return prediction_n
np.set_printoptions(precision=2)
num_plots = 11
plt.figure(figsize=(15,6))
colormap = plt.cm.gist_ncar
plt.gca().set_color_cycle([colormap(i) for i in np.linspace(0, 0.9, num_plots)])
x = np.arange(1,len(weights)+1)
# get prediction new
for lag in range(1,num_plots+1):
# get data
[newStockReturn,newFactor,stockReturnKlag] = ARDataExtract(stocksReturns,factorsReturns,lag)
# find parameter
[factor_columns,weights] = FindParamAutoRegression(newFactor,newStockReturn,stockReturnKlag)
weights = np.array(weights)
# prediction
prediction = ARPrediction(weights,stockReturnKlag,factor_columns)
MSE_n = np.sum((prediction-np.array(newStockReturn))**2,axis=1)/len(newStockReturn[0])
#plot
plt.plot(x,MSE_n,'--o',linewidth=1, label="order = {i}".format(i=lag),alpha=0.75)
plt.xlim(0,len(weights)+2)
plt.xlabel("Stock ID")
plt.ylabel("Mean Square Error")
plt.title("MSE each stock in 10 orders",fontsize=20)
plt.legend(loc=1)
plt.show()
$y_{t}=β_0+β_1y_{t−1}+β_2factor_1+β_3factor_2+β_4factor_3+β_5factor_4 $
def MADataExtract(stocksReturns,factorsReturns,lag):
stockReturn1lag = list()
for stock in stocksReturns:
stockReturn1lag.append([np.average(stock[i-lag:i]) for i in range(lag,len(stocksReturns[0]))])
newStockReturn = [stock[lag:] for stock in stocksReturns]
newFactor = [factor[lag:] for factor in factorsReturns]
return [newStockReturn,newFactor,stockReturn1lag]
fig, ax = plt.subplots(5,4,figsize=(20,10))
fig.suptitle("The change of first stock distribution",fontsize = 16)
for i in range (1,21):
plt.subplot(5,4,i)
[newStocksReturn,newFactors,stocksReturnlag] = MADataExtract(stocksReturns,factorsReturns,i*7)
plotDistribution(stocksReturnlag[0])
plt.xlabel('order={}'.format(i*7),fontsize=16)
plt.show()
| homoskedastic | heteroscedasticity |
|---|---|
![]() |
![]() |
$Var(ϵ|x_i)=E[ϵ^2|x_i]−(E[ϵ|x_i])^2$
${\hat {u}}^{2}=\gamma _{0}+\gamma _{1}x+v.\,$
so now we can define again our null hypothesis $H_0$ that $\gamma_1 = \gamma_2 = ... = 0$, alternative hypothesis $H_1$ $\gamma_1 \ne 0, \gamma_2 \ne 0 |...$
def getKColumnVar(factor_columns,nVarTest):
return factor_columns[:,0:nVarTest]
def BreuschPaganTest(prediction,newStockReturn,factor_columns,nVarTest=0):
residual = prediction-np.array(newStockReturn)
residual_all = np.sum((prediction-np.array(newStockReturn))**2,axis=0)/len(newStockReturn[0])
#residual_all = np.sum((prediction_n-np.array(newStockReturn))**2,axis=0)/len(newStockReturn[0])
factor_column_all = getKColumnVar(factor_columns[0],nVarTest) if len(np.shape(factor_columns))!=2 else factor_columns
print('{}'.format(''.join(['#']*100))
,'\nBreusch–Pagan test on Residuals of first 35 stock with p-value='
, tsd.het_breushpagan(residual_all,factor_column_all)[3],
'\n{}'.format(''.join(['#']*100)))
# RESIDUAL TEST on residual of each stock
pvalueList = list()
for i in range(1,len(residual)+1):
# check whether
if (len(np.shape(factor_columns))==2):
pvalueList.append(tsd.het_breushpagan(residual[i-1],factor_columns)[3])
else:
pvalueList.append(tsd.het_breushpagan(residual[i-1],factor_columns[i-1])[3])
print('Breusch–Pagan test on Residuals of stock {} with p-value={}'.format(i , pvalueList[-1]))
return pvalueList
# MODEL WITH FIRST 35 STOCKS - 4 FACTOR AFTER FACTORIZE
[factor_columns,weights] = FindParamRegression(factorsReturns,stocksReturns)
weights = np.array(weights).T
prediction = np.dot(factor_columns,weights).T
pvalueStocks_o = BreuschPaganTest(prediction,stocksReturns,factor_columns)
fig =plt.figure(figsize=(15,4))
x = np.arange(1,len(stocksReturns)+1)
plt.bar(x,np.array(pvalueStocks_o),width=0.6,color='r', label="p-value - lag = {i}".format(i=lag),alpha=0.6,)
plt.axhline(0.05, color='k', ls='--')
plt.text(1.65, 0.06, r'p-value = 0.05 line', fontsize=15)
plt.xlim(0,len(stocksReturns)+1)
plt.tight_layout()
plt.ylim(0,0.2)
plt.show()
lag = 1
# get data
[newStockReturn,newFactor,stockReturnKlag] = ARDataExtract(stocksReturns,factorsReturns,lag)
# find parameter
[factor_columns,weights] = FindParamAutoRegression(newFactor,newStockReturn,stockReturnKlag)
weights = np.array(weights)
# prediction
prediction = ARPrediction(weights,stockReturnKlag,factor_columns)
pvalueStocks = BreuschPaganTest(prediction,newStockReturn,factor_columns,13)
num_plots = 1
fig =plt.figure(figsize=(15,10))
colormap = plt.cm.gist_ncar
plt.gca().set_color_cycle([colormap(i) for i in np.linspace(0, 0.9, num_plots)])
x = np.arange(1,len(weights)+1)
# get prediction new
ax = fig.add_subplot(1,1,1)
ax.set_xticks(x)
ax2 = ax.twinx()
ax.set_xlabel("Stock ID")
ax.set_ylabel("Mean Square Error")
ax2.set_ylabel("p-value")
ax.set_ylim(-5,10)
ax2.set_ylim(0.1,-0.2,-0.01)
plt.grid(True,which="both",ls="-")
plt.title('relation between p-value of heteroskedasticity and MSE')
for lag in range(1,num_plots+1):
# get data
[newStockReturn,newFactor,stockReturnKlag] = ARDataExtract(stocksReturns,factorsReturns,lag)
# find parameter
[factor_columns,weights] = FindParamAutoRegression(newFactor,newStockReturn,stockReturnKlag)
weights = np.array(weights)
# prediction
prediction = ARPrediction(weights,stockReturnKlag,factor_columns)
MSE_n = np.sum((prediction-np.array(newStockReturn))**2,axis=1)/len(newStockReturn[0])
#plot
ax.plot(x,MSE_n,'--o',linewidth=1, label="MSE - lag = {i}".format(i=lag),alpha=0.75)
ax2.bar(x,np.array(pvalueStocks),width=0.6,color='r', label="p-value - lag = {i}".format(i=lag),alpha=0.6,)
ax2.axhline(0.05, color='k', ls='--')
ax2.text(1.65, 0.06, r'p-value = 0.05 line', fontsize=15)
plt.xlim(0,len(weights))
lines, labels = ax.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2, labels + labels2, loc=0)
plt.tight_layout()
plt.show()
def readYahooHistory(fname):
def process_line(line):
cols = line.split(',')
date = datetime.strptime(cols[0], '%d/%m/%Y')
value = float(cols[1])
return (date, value)
with open(fname) as f:
content_w_header = f.readlines()
# remove the first line
# and reverse lines to sort the data by date, in ascending order
content = content_w_header[-1:0:-1]
return list(map(process_line , content))
new_factors_folder= "marketFactor/"
new_factor_files = ['DJI.csv','N225.csv','RUT.csv','VIX.csv']
new_factor_files = map(lambda fn: new_factors_folder + fn, new_factor_files)
new_factors = [readYahooHistory(f) for f in new_factor_files]
new_allfactors = allfactors + new_factors
new_factors_ls = list(map( lambda factor:
fillInHistory_fix(trimToRegion(factor,start,end),start,end),
new_allfactors))
new_factorsReturns = list(map(twoWeekReturns, new_factors_ls))
plt.figure(figsize=(20,10))
plt.subplots_adjust(hspace=0)
ind = np.arange(len(new_factorsReturns[4]))
ax1 = plt.subplot(2,2,1)
data = np.array(new_factorsReturns[4])
lc1 = threshold_plot(ax1, ind, data, 0, 'r', 'g')
ax1.axhline(0, color='k', ls='--')
ax1.set_ylabel('DJI index factor',fontsize=24)
lc1.set_linewidth(2)
ax2 = plt.subplot(2,2,2,sharex=ax1)
data = np.array(new_factorsReturns[5])
lc2 = threshold_plot(ax2, ind, data, 0, 'r', 'g')
ax2.axhline(0, color='k', ls='--')
#ax2.tick_params(axis='y', colors='red',labelsize=15)
lc2.set_linewidth(2)
ax2.set_ylabel('N225 index factor',fontsize=24)
ax3 = plt.subplot(2,2,3,sharex=ax1)
data = np.array(new_factorsReturns[6])
lc3 = threshold_plot(ax3, ind, data, 0, 'r', 'g')
ax3.axhline(0, color='k', ls='--')
lc3.set_linewidth(2)
ax3.set_ylabel('RUT index factor',fontsize=24)
ax4 = plt.subplot(2,2,4,sharex=ax1)
data = np.array(new_factorsReturns[7])
lc4 = threshold_plot(ax4, ind, data, 0, 'r', 'g')
ax4.axhline(0, color='k', ls='--')
lc4.set_linewidth(2)
ax4.set_ylabel('VIX index factor',fontsize=24)
ax4.tick_params(axis='y', colors='red',labelsize=15)
xticklabels = ax1.get_xticklabels() + ax2.get_xticklabels()
plt.setp(xticklabels, visible=False)
plt.show()
plt.figure(figsize=(20,15))
plt.subplot(2,2,1)
plt.title('DIJ',fontsize=20)
plotDistribution(new_factorsReturns[4])
plt.subplot(2,2,2)
plt.title('N225',fontsize=20)
plotDistribution(new_factorsReturns[5])
plt.subplot(2,2,3)
plt.title('RUT',fontsize=20)
plotDistribution(new_factorsReturns[6])
plt.subplot(2,2,4)
plt.title('VIX',fontsize=20)
plotDistribution(new_factorsReturns[7])
plt.show()
np.set_printoptions(precision=2)
correlation = np.corrcoef(new_factorsReturns)
print("\tOil\tBond GSPC IXIC DIJ N225 RUT VIX")
correlation
CompareMSEbyFeature(stocksReturns,("4 factors",factorsReturns),("8 factors",new_factorsReturns))
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(stocksReturns,new_factorsReturns,150000,multivariate_normal)
HyTestMonteCarlo(newTrials,stocksReturns,newValueAtRisk)
VisualizeAfterModel(newTrials.collect(),stocksReturns) #.collect()
pp_y = sm.ProbPlot(np.array(ReturnOfAllStock(stocksReturns)),fit=True)
pp_x = sm.ProbPlot(np.array(newTrials.collect())[0:len(np.array(ReturnOfAllStock(stocksReturns)))],fit=True)
fig = pp_x.qqplot(line='45',other=pp_y)
plt.show()
allStocksReturn = ReturnOfAllStock(stocksReturns)
print("Stocks Return from real data vs from our simulation: p-value=",
stats.ks_2samp(newTrials.collect()
,allStocksReturn)[1])
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(stocksReturns,new_factorsReturns,150000,multivariate_t_rvs,9)
HyTestMonteCarlo(newTrials,stocksReturns,newValueAtRisk)
VisualizeAfterModel(newTrials.collect(),stocksReturns)
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import Lasso
from sklearn import preprocessing
#A helper method for pretty-printing linear models
def pretty_print_linear(coefs, names = None, sort = False):
if names == None:
names = ["X%s" % x for x in range(len(coefs))]
lst = zip(coefs, names)
if sort:
lst = sorted(lst, key = lambda x:-np.abs(x[0]))
return " + ".join("%s * %s" % (round(coef, 3), name)
for coef, name in lst)
# standardize the data to apply the lasso
feature = transpose(featurize(new_factorsReturns))
X = preprocessing.scale(feature)
Y = stocksReturns[0]
featureName = ['Oil','Bond', 'GSPC','IXIC','DIJ','N225','RUT','VIX']
names = ['square' + s for s in featureName]+['root'+ s for s in featureName]+featureName
coefAlphaLasso = list()
alphaList = [0.00001,0.0001,0.001,0.01,0.1]
for alpha in alphaList:
# run lasso fitting
lasso = Lasso(alpha=alpha)
lasso.fit(X, Y)
coefAlphaLasso.append(lasso.coef_.tolist())
featuresDF = pd.DataFrame(np.transpose(coefAlphaLasso),columns=['1e-5','1e-4','1e-3','1e-2','1e-1'],index=names)
featuresDF[(featuresDF<0.01) & (featuresDF>-0.01)] = -float('Inf')
featuresDF.sort(['1e-1','1e-2','1e-3','1e-4','1e-5'],ascending=False)
def simulateTrialReturnsCustomFeature(numTrials, factorMeans, factorCov, weights, customFeatureList, multivariate_func, *arg):
trialReturns = []
for i in range(0, numTrials):
# generate sample of factors' returns
[trialFactorReturns] = multivariate_func(factorMeans,factorCov,1,*arg)
# featurize the factors' returns
trialFeatures = featurizeCustom(list(trialFactorReturns),customFeatureList)
# insert weight for intercept term
trialFeatures.insert(0,1)
# calculate the return of each instrument
trialTotalReturn = np.dot(weights,trialFeatures)
# then calulate the total of return for this trial features
trialTotalReturn = sum(trialTotalReturn)
trialReturns.append(trialTotalReturn)
return trialReturns
def featurizeCustom(factorRow,customFeatureList):
arr = np.array(factorRow)
featureReturn = list()
for item in customFeatureList:
idx = featureName.index(item[0])
if item[1][0] == 1:
featureReturn.append(factorRow[idx])
if item[1][1] == 1:
featureReturn.append(np.power(factorRow[idx],2)*np.sign(factorRow[idx]))
if item[1][2] == 1:
featureReturn.append(np.sqrt(np.abs(factorRow[idx]))*np.sign(factorRow[idx]))
return featureReturn
def FactorParameterCustomFeature(stocksReturns,factorsReturns,customFeatureList):
# Cov and Mean
factorCov = np.cov(factorsReturns)
factorMeans = [sum(x)/len(x) for x in factorsReturns]
# transpose factorsReturns
factorMat = transpose(factorsReturns)
# featurize each row of factorMat
factorFeatures = list(map(lambda x: featurizeCustom(x,customFeatureList),factorMat))
# OLS require parameter is a numpy array
factor_columns = np.array(factorFeatures)
#add a constant - the intercept term for each instrument i.
factor_columns = sm.add_constant(factor_columns, prepend=True)
# Estimate new weight for linear regression model
weights = [estimateParams(stockReturns,factor_columns) for stockReturns in stocksReturns]
return [factorMeans,factorCov,weights,factor_columns]
# OFFLINE VERSION
def MonteCarloGeneralEstimateCustomFeature(newStocksReturns,factorsReturns,customFeatureList,numTrials,multivariate_func,*arg):
# Estimate new weight for linear regression model
param = FactorParameterCustomFeature(newStocksReturns,factorsReturns,customFeatureList)
#Spark parallel
parallelism = 12
trial_indexes = list(range(0, parallelism))
seedRDD = sc.parallelize(trial_indexes, parallelism)
bNewFactorWeights = sc.broadcast(param[2])
# Simulation
newTrials = seedRDD.flatMap(lambda idx: simulateTrialReturnsCustomFeature(
max(int(numTrials / parallelism), 1),
param[0], param[1],
bNewFactorWeights.value,customFeatureList,multivariate_func,*arg))
newTrials.cache()
newValueAtRisk = fivePercentVaR(newTrials)
newConditionalValueAtRisk = fivePercentCVaR(newTrials)
print("New Value at Risk(VaR) 5%:", newValueAtRisk)
print ("New Conditional Value at Risk(CVaR) 5%:", newConditionalValueAtRisk)
return [newTrials,newValueAtRisk]
#newTrials = simulateTrialReturnsCustomFeature(
# numTrials,
# param[0], param[1],
# param[2],customFeatureList,multivariate_func,*arg)
#newValueAtRisk = fivePercentVaROff(newTrials)
!pip install --upgrade scikit-learn
from sklearn.model_selection import KFold
#from sklearn.model_selection import KFold
def trainingOLS(stocksReturns,factorsReturns,selectedFeature):
##Train
[a,b,weights,factor_columns] = FactorParameterCustomFeature(
stocksReturns,factorsReturns,selectedFeature)
weights = np.array(weights).T
return weights
def calculateMSE(stocksReturns,factorsReturns,selectedFeature,weights):
##Test
# Featurize test factor
factorMat = transpose(factorsReturns)
factorFeatures = list(map(lambda x: featurizeCustom(x,selectedFeature),factorMat))
factor_columns = np.array(factorFeatures)
factor_columns = sm.add_constant(factor_columns, prepend=True)
# prediction
prediction = np.dot(factor_columns,weights).T
#MSE
#print(np.shape(np.sum((prediction-np.array(stocksReturns))**2,
# axis=1)),len(stocksReturns[0]),np.shape(prediction-np.array(stocksReturns)))
MSE = np.sum((prediction-np.array(stocksReturns))**2,
axis=1)/len(stocksReturns[0])
return np.sum(MSE)/len(MSE)
def getDatafromIndex(data,idx):
return [list(map(x.__getitem__, idx)) for x in data]
def crossValidateLinearModel(stocksReturns,factorsReturns,selectedFeature,n_kfold):
kf = KFold(n_splits=n_kfold)
total = 0
i = 1
for trainIdx, testIdx in kf.split(stocksReturns[0]):
foldTrainStocksReturns = getDatafromIndex(stocksReturns,trainIdx)
foldTrainFactorsReturn = getDatafromIndex(factorsReturns,trainIdx)
foldTestStocksReturns = getDatafromIndex(stocksReturns,testIdx)
foldTestFactorsReturn = getDatafromIndex(factorsReturns,testIdx)
weights = trainingOLS(foldTrainStocksReturns,foldTrainFactorsReturn,selectedFeature)
print("%d-fold: MSE="%(i),calculateMSE(foldTestStocksReturns,foldTestFactorsReturn,selectedFeature,weights))
i += 1
selectedFeature = [('Oil',[1,1,0]),('Bond',[1,0,0]),('IXIC',[1,1,1])]
crossValidateLinearModel(stocksReturns,new_factorsReturns,selectedFeature,20);

def crossValidateRollingBasisLinearModel(stocksReturns,factorsReturns,selectedFeature,predictRange,kTest):
total = 0
nObs = len(stocksReturns[0])
for i in np.arange(kTest)[::-1]:
# Index train-test
trainIdx = np.arange(0,nObs-(i+1)*predictRange)
testIdx = np.arange(nObs-(i+1)*predictRange,nObs)
# get train-test data
foldTrainStocksReturns = getDatafromIndex(stocksReturns,trainIdx)
foldTrainFactorsReturn = getDatafromIndex(factorsReturns,trainIdx)
foldTestStocksReturns = getDatafromIndex(stocksReturns,testIdx)
foldTestFactorsReturn = getDatafromIndex(factorsReturns,testIdx)
# training and testing
weights = trainingOLS(foldTrainStocksReturns,foldTrainFactorsReturn,selectedFeature)
#print("Test-%d: MSE="%(kTest-i),calculateMSE(foldTestStocksReturns,foldTestFactorsReturn,selectedFeature,weights))
total += calculateMSE(foldTestStocksReturns,foldTestFactorsReturn,selectedFeature,weights)
return total/kTest
selectedFeature = [('Oil',[1,1,0]),('Bond',[1,0,0]),('IXIC',[1,1,1])]
crossValidateRollingBasisLinearModel(stocksReturns,new_factorsReturns,selectedFeature,10,10)
selectedFeature = [('Oil',[1,1,1]),('Bond',[1,1,1]),('GSPC',[1,1,1]),('IXIC',[1,1,1])
,('DIJ',[1,1,1]),('N225',[1,1,1]),('RUT',[1,1,1]),('VIX',[1,1,1])]
print('Mean Square Error of Linear Regression:',
crossValidateRollingBasisLinearModel(stocksReturns,new_factorsReturns,selectedFeature,10,10))
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimateCustomFeature(stocksReturns,new_factorsReturns,
selectedFeature,200000,multivariate_t_rvs,9.5)
HyTestMonteCarlo(newTrials,stocksReturns,newValueAtRisk)
selectedFeature = [('Oil',[0,0,0]),('Bond',[0,0,0]),('GSPC',[0,0,0]),('IXIC',[0,0,0])
,('DIJ',[0,0,0]),('N225',[0,0,1]),('RUT',[1,0,0]),('VIX',[0,0,0])]
print('Mean Square Error of Linear Regression:',
crossValidateRollingBasisLinearModel(stocksReturns,new_factorsReturns,selectedFeature,10,10))
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimateCustomFeature(stocksReturns,new_factorsReturns,
selectedFeature,200000,multivariate_t_rvs,6.5)
HyTestMonteCarlo(newTrials,stocksReturns,newValueAtRisk)
selectedFeature = [('Oil',[0,1,0]),('Bond',[1,1,0]),('GSPC',[1,1,1]),('IXIC',[0,1,1])
,('DIJ',[1,1,1]),('N225',[1,1,0]),('RUT',[1,1,1]),('VIX',[0,1,1])]
print('Mean Square Error of Linear Regression:',
crossValidateRollingBasisLinearModel(stocksReturns,new_factorsReturns,selectedFeature,10,10))
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimateCustomFeature(stocksReturns,new_factorsReturns,
selectedFeature,200000,multivariate_t_rvs,7.3)
HyTestMonteCarlo(newTrials,stocksReturns,newValueAtRisk)

The idea is to apply a smooth, invertible transformation to some univariate data so that the distribution of the transformed data is as Gaussian as possible.
A standard pre-processing step is to "whiten" data by subtracting the mean and scaling it to have standard deviation
Robust statistics / reduce effect of outliers. Lots of real world data exhibits long tails.
Theoretical benefits. Gaussian distributions are very well studied with many unique properties that you may like to leverage.
fig = plt.figure(figsize=(20,6))
ax1 = fig.add_subplot(121)
x = np.array(factorsReturns[0]) - (np.min(factorsReturns[0])-1)
prob = stats.probplot(x, dist=stats.norm, plot=ax1)
ax1.set_xlabel('')
ax1.set_title('Probplot against normal distribution')
ax2 = fig.add_subplot(122)
xt, _ = stats.boxcox(x)
prob = stats.probplot(xt, dist=stats.norm, plot=ax2)
ax2.set_title('Probplot after Box-Cox transformation')
fig.suptitle('GSPC factor',fontsize = 20)
def invboxcox(y,ld):
if ld == 0:
return(np.exp(y))
else:
return(np.power(ld*y+1,1/ld))
factorsReturnsNormalize = list()
lds = list()
posi = list()
for i in range(0,len(factorsReturns)):
# keep the min value to reverse
posi.append(np.min(factorsReturns[i])-1)
# transform
x = np.array(factorsReturns[i]) - posi[-1]
[xt,ld] = stats.boxcox(x)
lds.append(ld)
factorsReturnsNormalize.append(xt)
KSresult = list()
KSresult.append(["Oil",KStestNormal(factorsReturns[0])[1], KStestNormal(factorsReturnsNormalize[0])[1]])
KSresult.append(["Treasury",KStestNormal(factorsReturns[1])[1], KStestNormal(factorsReturnsNormalize[1])[1]])
KSresult.append(["GSPC",KStestNormal(factorsReturns[2])[1], KStestNormal(factorsReturnsNormalize[2])[1]])
KSresult.append(["IXIC",KStestNormal(factorsReturns[3])[1], KStestNormal(factorsReturnsNormalize[3])[1]])
pd.DataFrame(KSresult,columns=['factor','before normalize','after normalize'])
print('Before trasnformation:', factorsReturns[0][0])
print('After transformation:', factorsReturnsNormalize[0][0])
print('Reverse the transformation:',invboxcox(factorsReturnsNormalize[0][0],lds[0])+posi[0])
plt.figure(figsize=(20,15))
plt.subplot(2,2,1)
plt.title('Crude Oil',fontsize=20)
plotDistribution(factorsReturnsNormalize[0])
plt.subplot(2,2,2)
plt.title('Treasury Bond',fontsize=20)
plotDistribution(factorsReturnsNormalize[1])
plt.subplot(2,2,3)
plt.title('GSPC',fontsize=20)
plotDistribution(factorsReturnsNormalize[2])
plt.subplot(2,2,4)
plt.title('IXIC',fontsize=20)
plotDistribution(factorsReturnsNormalize[3])
plt.show()
def multivariate_transformednormal_rvs(mean,S,n,lds,posi):
samples = np.random.multivariate_normal(mean,S,(n,))[0]
while True:
reverse_sample = list()
for idx,sample in enumerate(samples):
temp = invboxcox(sample,lds[idx])+posi[idx]
reverse_sample.append(temp)
# NaN problem
if (len(np.argwhere(np.isnan(reverse_sample))) ==0):
break
samples = np.random.multivariate_normal(mean,S,(1,))[0]
return [reverse_sample]
def fivePercentVaROff(newTrials):
newTrials.sort(key=lambda x: x)
return newTrials[round(len(newTrials)/20)]
def MonteCarloNormalizedEstimate(newStocksReturns,factorsReturns,factorsReturnsNormalize,numTrials,multivariate_func,*arg):
# Estimate new weight for linear regression model
param = FactorParameter(newStocksReturns,factorsReturns)
param1 = FactorParameter(newStocksReturns,factorsReturnsNormalize)
# run parallel
parallelism = 12
trial_indexes = list(range(0, parallelism))
seedRDD = sc.parallelize(trial_indexes, parallelism)
bNewFactorWeights = sc.broadcast(param[2])
#newTrials = simulateTrialReturns(
# numTrials,
# param1[0],param1[1],
# param[2],multivariate_func,*arg)
#newValueAtRisk = fivePercentVaROff(newTrials)
newTrials = seedRDD.flatMap(lambda idx: simulateTrialReturns(
max(int(numTrials / parallelism), 1),
param1[0],param1[1],
bNewFactorWeights.value,multivariate_func,*arg))
newTrials.cache()
newValueAtRisk = fivePercentVaR(newTrials)
#newConditionalValueAtRisk = fivePercentCVaR(newTrials)
print("New Value at Risk(VaR) 5%:", newValueAtRisk)
return [newTrials,newValueAtRisk]
[newTrials,newValueAtRisk] = MonteCarloNormalizedEstimate(stocksReturns,factorsReturns,factorsReturnsNormalize,
300000,multivariate_transformednormal_rvs,lds,posi)
print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, newValueAtRisk, 0.05))
def multivariate_normal(factorMeans,factorCov,n,*arg):
return [np.random.multivariate_normal(factorMeans,factorCov)]
[newTrials,newValueAtRisk] = MonteCarloGeneralEstimate(stocksReturns,factorsReturns,200000,multivariate_normal)
print("Kupiec test p-value: " , kupiecTestPValue(stocksReturns, newValueAtRisk, 0.05))
stddev = 10;stddev2=5
mu = 10;mu2=30
domain = np.arange(mu-50,mu+50)
y1 = mlab.normpdf(domain, mu, stddev)
y2 = mlab.normpdf(domain, mu2, stddev2)
# plot
plt.figure(figsize=(20,6))
plt.plot(domain,y1,color='b',alpha = 0.8,lw=3, label = 'distribution 1')
plt.plot(domain,y2,color='r',alpha = 0.8,lw=3, label = 'distribution 2')
plt.plot(domain,y1+y2,'k--',alpha = 1.0,lw=5,label='mixture gaussian')
plt.legend(loc=2)
plt.plot()
**$f(y+\Delta x)=f(y)+f'(y)\Delta x+\epsilon(1)$**
**$f(y+x)=f(y)+f'(y) \Delta x + \frac{1}{2} f''(y) \Delta x^2+ \epsilon(2)$**
# get data
[stocksReturnsDG,factorsReturnsDG,stocksReturns1lag] = ARDataExtract(stocksReturns,factorsReturns,1)
stocksReturns1lag = stocksReturns1lag[0]
stocksReturnsDG_1order = np.array(stocksReturnsDG)-np.array(stocksReturns1lag)
stocksReturnsDG_2order = np.square(stocksReturnsDG_1order)
def plotDistribution(samples):
vmin = min(samples)
vmax = max(samples)
stddev = np.std(samples)
mu = np.mean(samples)
domain = np.arange(vmin, vmax, (vmax-vmin)/100)
# a simple heuristic to select bandwidth
bandwidth = 1.06 * stddev * pow(len(samples), -.2)
# estimate density
kde = KDEUnivariate(samples)
kde.fit(bw=bandwidth)
density = kde.evaluate(domain)
# plot
plt.hist(samples,normed=1,bins=100,color='b',alpha = 0.5,label='Real data')
plt.plot(domain, density,lw=3,color='r',alpha = 0.85, label= 'KDE estimation')
plt.legend()
numStock = np.arange(0,5)
nCols = 3
nRows = len(numStock)
f , axarr = plt.subplots(nRows, nCols,figsize=(20,30))
f.suptitle("Compare distribution of real data - 1st order - 2nd order of 5 first stock", fontsize = 25)
for idx in numStock:
plt.axes(axarr[idx,0])
plotDistribution(stocksReturnsDG[idx])
plt.ylabel('StockID:{}'.format(idx),fontsize=20)
plt.title('Return',fontsize=20)
plt.axes(axarr[idx,1])
plotDistribution(stocksReturnsDG_1order[idx])
plt.title('1st order return',fontsize=20)
plt.axes(axarr[idx,2])
plotDistribution(stocksReturnsDG_2order[idx])
plt.title('2nd order return',fontsize=20)
plt.show()
def FindParamDeltaGammaModel(stocksReturns,factorsReturns,stockReturns_1order,stockReturns_2order):
factorMat = transpose(factorsReturns)
# featurize each row of factorMat
factorFeatures = list(map(featurize,factorMat))
# OLS require parameter is a numpy array
factor_columns = np.array(factorFeatures)
#add a constant - the intercept term for each instrument i.
factor_columns = sm.add_constant(factor_columns, prepend=True).T.tolist()
# Estimate new weight for linear regression model
new_factor_columns = list()
weights = list()
for ind,stock in enumerate(stocksReturns):
extra_1order = stockReturns_1order[ind]
extra_2order = stockReturns_2order[ind]
feature_vector = np.asanyarray(factor_columns+[extra_1order,extra_2order]).T
new_factor_columns.append(feature_vector)
# fit with regression model
weights.append(estimateParams(stock,feature_vector))
# Cov and Mean
factorCov = np.cov(factorsReturns)
factorMeans = [sum(x)/len(x) for x in factorsReturns]
return [np.asanyarray(new_factor_columns),weights,factorMeans,factorCov]
def MSEwithGammaDeltaModel(factorsReturns,stocksReturns):
#get prediction
[factor_columns,weights] = FindParamRegression(factorsReturns,stocksReturns)
weights = np.array(weights).T
prediction = np.dot(factor_columns,weights).T
# get prediction new
[newStocksReturns,newFactor,stockReturn1lag] = ARDataExtract(stocksReturns,factorsReturns,1)
stockReturn1lag = stockReturn1lag[0]
newStocksReturn1order = np.array(newStocksReturns)-np.array(stockReturn1lag)
newStocksReturn2order = np.square(newStocksReturn1order)
[factor_columns,weights,a,b] = FindParamDeltaGammaModel(newStocksReturns,newFactor,newStocksReturn1order,newStocksReturn2order)
weights = np.array(weights)
# predict for each stock
prediction_n = list()
for ind,x in enumerate(newStocksReturns):
prediction_n.append(np.dot(factor_columns[ind],weights[ind]).T.tolist())
# compute MSE
MSE = np.sum((prediction-np.array(stocksReturns))**2,axis=1)/len(stocksReturns[0])
MSE_n = np.sum((prediction_n-np.array(newStocksReturns))**2,axis=1)/len(newStocksReturns[0])
x = np.arange(1,len(MSE)+1)
# plot data
#sns.set_context("notebook", font_scale=1.2)
plt.figure(figsize=(15,6))
plt.plot(x,-MSE_n+MSE,'b-o',linewidth=1, label="Difference between Blue and Red MSE",alpha=0.75)
plt.bar(x,MSE,0.8, color='c',label="MSE - Origin", alpha = 1,align='center')
plt.bar(x,MSE_n,0.8,color='r', label="MSE - Gamma-Delta",alpha=0.65,align='center')
plt.xlim(0,len(weights)+1)
plt.xlabel("Stock ID")
plt.ylabel("Mean Square Error")
plt.legend(loc=0,fontsize=16)
plt.show()
MSEwithGammaDeltaModel(factorsReturns,stocksReturns);
def getRandomRawStockData(size):
# select path of all stock data files in "stock_folder"
files = [join(stock_folder, f) for f in listdir(stock_folder) if isfile(join(stock_folder, f))]
# assume that we invest only the first 35 stocks (for faster computation)
files = random.sample(files, len(files))
files = files[:size]
# read each line in each file, convert it into the format: (date, value)
rawStocks = [process_stock_file(f) for f in files]
number_of_years = 7
rawStocks = list(filter(lambda instrument:((instrument[-1][0]-instrument[0][0]).days/365) > 7
, rawStocks))
return rawStocks
newrawStocks = getRandomRawStockData(100)
newStocks = list(map(lambda newStock: fillInHistory_fix(trimToRegion(newStock, start, end)
,start, end), newrawStocks))
newStocksReturns = list(map(twoWeekReturns, newStocks))
newLowRiskStocksReturns = RemoveRiskStock(newStocksReturns,-80,100,8)
print("the number of stock included risky stock is",len(newStocksReturns))
print("the number of stock we invest now is",len(newLowRiskStocksReturns))
MSEwithGammaDeltaModel(factorsReturns,newLowRiskStocksReturns)
In this lecture, we studied the Monte Carlo Simulation method and its application to estimate financial risk. To apply it, first, we needed to define the relationship between market factors and the instruments' returns. In such step, you must define the model which maps the market factors' values to the instruments' values: in our use case, we used a linear regression function for building our model. Next, we also had to find the parameters of our model, which are the weights of the factors we considered. Then, we had to study the distribution of each market factor. A good way to do that is using Kernel density estimation to smooth the distribution and plot it. Depending on the shape of each figure, we had to guess the best fit distribution for each factor: in our use case, we used a very simple approach, and decided that our smoothed distributions all looked normal distributions.
Then, the idea of Monte Carlo simulation was to generate many possible values for each factor and calculate the corresponding outcomes by a well-defined model in each trial. After many trials, we were able to calculate VaR from the sequences of outcome's values. When the number of trials is large enough, the VaR converges to reasonable values, that we could validate using well-known statistical hypothesis.
_